Files
rs-retrieval/rs_retrieval.tex
2026-02-11 11:09:40 +08:00

798 lines
94 KiB
TeX
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\documentclass[lettersize,journal]{IEEEtran}
\usepackage{amsmath,amsfonts}
\usepackage{array}
%\usepackage[caption=false,font=normalsize,labelfont=sf,textfont=sf]{subfig}
\usepackage{textcomp}
%\usepackage{stfloats}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{cite}
\hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore}
% updated with editorial comments 8/9/2021
% custom package
% for algorithm
\usepackage{algorithmic}
%\usepackage{algorithm}
\usepackage[linesnumbered,lined,ruled]{algorithm2e}
% for fig
\usepackage{makecell}
\ifCLASSOPTIONcompsoc
\usepackage[tight,normalsize,sf,SF]{subfigure}
\else
\usepackage[tight,footnotesize]{subfigure}
\fi
\begin{document}
\title{An I/O-Efficient Approach for Concurrent Spatio-Temporal Range Retrievals over Large-Scale Remote Sensing Image Data}
\author{Ao~Long,
Wei~Lin,
and Ze~Deng$^{\dagger}$
% <-this % stops a space
\IEEEcompsocitemizethanks{
\IEEEcompsocthanksitem A. Long, W. Lin and Z. Deng, (Corresponding author, dengze@cug.edu.cn) are with the School of Computer Science, China University of Geosciences, Wuhan, 430078, P. R. China.
\IEEEcompsocthanksitem Z. Deng is also with Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430074, China.
}% <-this % stops a space
\thanks{}
}
% The paper headers
\markboth{Journal of \LaTeX\ Class Files,~Vol.~14, No.~8, August~2021}%
{Shell \MakeLowercase{\textit{et al.}}: A Sample Article Using IEEEtran.cls for IEEE Journals}
%\IEEEpubid{0000--0000/00\$00.00~\copyright~2021 IEEE}
% Remember, if you use this you must call \IEEEpubidadjcol in the second
% column for its text to clear the IEEEpubid mark.
\maketitle
\begin{abstract}
High-performance remote sensing analytics workflows require ingesting and retrieving massive image archives to support real-time spatio-temporal applications. While modern systems utilize window-based I/O reading to reduce data transfer, they face a dual bottleneck: (1) the prohibitive overhead of runtime geospatial computations caused by the decoupling of logical indexing from physical storage, and (2) severe storage-level I/O contention triggered by uncoordinated concurrent reads. To address these limitations, we present a comprehensive I/O-aware retrieval approach based on a novel ``Index-as-an-Execution-Plan" paradigm. We introduce a dual-layer inverted index that serves as an I/O planner, pre-materializing grid-to-pixel mappings to completely eliminate runtime geometric calculations. Furthermore, we design a hybrid concurrency-aware I/O coordination protocol that adaptively integrates Calvin-style deterministic ordering with optimistic execution, effectively converting I/O contention into request merging opportunities. To handle fluctuating workloads, we incorporate a Surrogate-Assisted Genetic Multi-Armed Bandit (SA-GMAB) for automatic parameter tuning. Evaluated on a distributed cluster with Martian datasets, the experimental results indicate that: (1) I/O-aware indexing reduces retrieval latency by an order of magnitude; (2) hybrid concurrency-aware I/O coordination achieves a 54x speedup under high contention through request merging and automates optimal mode switching; and (3) SA-GMAB has the fastest convergence speed and recovers from workload shifts $2\times$ faster than TunIO.
\end{abstract}
\begin{IEEEkeywords}
Remote sensing data management, Spatio-temporal range retrievals, I/O-aware indexing, Concurrency control, I/O tuning.
\end{IEEEkeywords}
\section{Introduction}
\IEEEPARstart{A} massive amount of remote sensing (RS) data, characterized by high spatial, temporal, and spectral resolutions, is being generated at an unprecedented speed due to the rapid advancement of Earth observation missions \cite{Ma15RS_bigdata}. For instance, NASA's AVIRIS-NG acquires nearly 9 GB of data per hour, while the EO-1 Hyperion sensor generates over 1.6 TB daily \cite{Haut21DDL_RS}. Beyond the sheer volume of data, these datasets are increasingly subjected to intensive concurrent access from global research communities and real-time emergency response systems (e.g., multi-departmental coordination during natural disasters). Consequently, modern RS platforms are required to provide not only massive storage capacity but also high-throughput retrieval capabilities to satisfy the simultaneous demands of numerous spatio-temporal analysis tasks.
Existing RS data management systems \cite{LEWIS17datacube, Yan21RS_manage1, liu24mstgi} typically decompose a spatio-temporal range retrieval into a decoupled two-phase execution model. The first phase is the metadata filtering phase, which utilizes spatio-temporal metadata (e.g., footprints, timestamps) to identify candidate image files that intersect the retrieval predicate. Recent advancements have transitioned from traditional tree-based indexes \cite{Strobl08PostGIS, Simoes16PostGIST} to scalable distributed schemes based on grid encodings and space-filling curves, such as GeoHash \cite{suwardi15geohash}, GeoSOT \cite{Yan21RS_manage1}, and GeoMesa \cite{hughes15geomesa, Li23TrajMesa}. By leveraging these high-dimensional indexing structures, the search complexity of the first phase has been effectively reduced to $O(\log N)$ or even $O(1)$, making metadata discovery extremely efficient even for billion-scale datasets.
The second phase is the data extraction phase, where the system reads the actual pixel data from the identified raw image files stored in distributed file systems or object stores. A critical observation in modern high-performance RS analytics is that the primary system bottleneck has fundamentally shifted from the first phase to the second. While the metadata search completes in milliseconds, the end-to-end retrieval latency is now dominated by the massive I/O overhead required to fetch, decompress, and process large-scale raw images. Traditional systems attempted to reduce I/O overhead by pre-slicing tiles and building pyramids (e.g., approaches used in Google Earth Engine \cite{gorelick17GEE} that store metadata in HBase and serve pre-tiled image pyramids), but aggressive tiling increases management complexity and produces many small files. More recent Cloud-Optimized GeoTIFF (COG) formats and COG-aware frameworks \cite{LEWIS17datacube}, \cite{riotiler25riotiler} exploit internal overviews and window-based I/O to minimize I/O access. Window-based I/O, also known as partial data retrieval, refers to the technique of reading only a specific spatial subset of a raster dataset rather than loading the entire file into memory. This is achieved by defining a window (bounding box) that specifies the pixel coordinates corresponding to the geographical area of interest.
While window-based I/O effectively reduces raw data transfer, it introduces a new computational burden due to the decoupling of logical indexing from physical storage. Current systems operate on a ``Search-then-Compute-then-Read" model: after identifying candidate files, they must perform fine-grained, per-image geospatial computations at runtime to map retrieval coordinates to precise file offsets and clip boundaries. This runtime geometric resolution becomes computationally prohibitive when processing a large volume of candidate images, often negating the benefits of I/O reduction. Moreover, under concurrent workloads, the lack of coordination among these independent read requests leads to severe I/O contention and storage thrashing, rendering traditional indexing-centric optimizations insufficient for real-time applications.
To address the aforementioned problems, we propose a novel ``Index-as-an-Execution-Plan" paradigm to bound the retrieval latency. Unlike conventional approaches that treat indexing and I/O execution as separate stages, our approach integrates fine-grained partial retrieval directly into the indexing structure. By pre-materializing the mapping between logical spatial grids and physical pixel windows, our system enables deterministic I/O planning without runtime geometric computation. To further ensure scalability, we introduce a concurrency control protocol tailored for spatio-temporal range retrievals and an automatic I/O tuning mechanism. The principal contributions of this paper are summarized as follows:
\begin{enumerate}
\item We propose an I/O-aware index schema. Instead of merely returning candidate image identifiers, our index directly translates high-level spatio-temporal predicates into concrete, byte-level windowed read plans. This design bridges the semantic gap between logical retrievals and physical storage, eliminating expensive runtime geospatial computations and ensuring that I/O cost is strictly proportional to the retrieval footprint.
\item We propose a hybrid concurrency-aware I/O coordination protocol. This protocol adapts transaction processing principles by integrating Calvin-style deterministic ordering \cite{Thomson12Calvin} with optimistic execution \cite{Lim17OCC}. It shifts the focus from protecting database rows to coordinating shared I/O flows. This protocol dynamically switches strategies based on spatial contention, effectively converting I/O contention into request merging opportunities.
\item We propose an automatic I/O tuning method to improve the I/O performance of spatio-temporal range retrievals over RS data. The method extends an existing AI-powered I/O tuning framework \cite{Rajesh24TunIO} based on a surrogate-assisted genetic multi-armed bandit algorithm \cite{Preil25GMAB}.
\end{enumerate}
The remainder of this paper is organized as follows:
Section~\ref{sec:RW} presents the related work.
Section~\ref{sec:DF} formulates the spatio-temporal range retrieval problem and establishes the cost models.
Section~\ref{sec:Overview} provides an overview of the proposed framework and describes how the three modules are integrated.
Section~\ref{sec:Index} presents the I/O-aware indexing structure.
Section~\ref{sec:CC} proposes the hybrid concurrency-aware I/O coordination protocol.
Section~\ref{sec:Tuning} presents the GMAB-based online I/O stack tuning method.
Section~\ref{sec:EXP} presents the experiments and results.
Section~\ref{sec:Con} concludes this paper with a summary.
\section{Related Work}\label{sec:RW}
This section describes the most salient studies of I/O-efficient spatio-temporal retrieval processing, concurrency control, and I/O performance tuning.
\subsection{I/O-Efficient Spatio-Temporal Retrieval Processing}
Efficient spatio-temporal retrieval for RS data has been extensively studied, with early efforts primarily focusing on metadata organization and index-level pruning in relational database systems. Traditional approaches typically extend tree-based spatial indexes, such as R-tree \cite{Strobl08PostGIS}, quadtree \cite{Tang12Quad-Tree}, and their spatio-temporal variants \cite{Simoes16PostGIST}, to organize image footprints together with temporal attributes, and are commonly implemented on relational backends (e.g., MySQL and PostgreSQL). These methods provide efficient range filtering for moderate-scale datasets, but their reliance on balanced tree structures often leads to high maintenance overhead and limited scalability as the volume of remote sensing metadata grows rapidly. With the continuous increase in data volume and ingestion rate, recent systems have gradually shifted toward grid-based spatio-temporal indexing schemes deployed on distributed NoSQL stores. By encoding spatial footprints into uniform spatial grids \cite{suwardi15geohash, Yan21RS_manage1} or space-filling curves \cite{liu24mstgi, Yang24GridMesa} and combining them with temporal identifiers, these approaches enable lightweight index construction and better horizontal scalability on backends such as HBase and Elasticsearch. Such grid-based indexes can effectively reduce the candidate search space through coarse-grained pruning and are more suitable for large-scale, continuously growing remote sensing archives.
However, index pruning alone is insufficient to guarantee end-to-end retrieval efficiency for remote sensing workloads, where individual images are usually large and retrieval results require further pixel-level processing. To reduce the amount of raw I/O, Google Earth Engine \cite{gorelick17GEE} relies on tiling and multi-resolution pyramids that physically split images into small blocks. While more recent solutions leverage COG and window-based I/O to enable partial reads from monolithic image files, frameworks such as OpenDataCube \cite{LEWIS17datacube} exploit these features to read only the image regions intersecting a retrieval window, thereby reducing unnecessary data transfer. Nevertheless, after candidate images are identified, most systems still perform fine-grained geospatial computations for each image, including coordinate transformations and precise pixel-window derivation, which may incur substantial overhead when many images are involved. In this paper, we propose an I/O-aware index that pre-materializes grid-to-pixel mappings to eliminate runtime geometric calculations, enabling direct translation of spatio-temporal predicates into byte-level windowed read plans.
\subsection{Concurrency Control}
Concurrency control has long been studied to provide correctness and high throughput in multi-user database and storage systems, with two broad paradigms dominating the literature: deterministic scheduling \cite{Thomson12Calvin, hong2025deterministic} and non-deterministic schemes \cite{Bernstein812PL}, \cite{KungR81OCC}. Hybrid approaches \cite{WangK16MVOCC}, \cite{Hong25HDCC} that adaptively combine these paradigms seek to exploit the low-conflict efficiency of deterministic execution while retaining the flexibility of optimistic techniques. More recent proposals, such as OOCC \cite{Wu25OOCC}, target read-heavy, disaggregated settings by reducing validation and round-trips for read-only transactions, achieving low latency under OLTP-like workloads. These methods are primarily optimized for record- or key-level access patterns: their metrics and designs emphasize transaction latency, abort rates, and throughput under workloads with small, well-defined read/write sets.
Overall, existing concurrency control mechanisms are largely designed around transaction-level correctness and throughput, assuming record- or key-based access patterns and treating storage I/O as a black box. Their optimization objectives rarely account for I/O amplification or fine-grained storage contention induced by concurrent range retrievals. Consequently, these approaches are ill-suited for data-intensive spatio-temporal workloads, where coordinating overlapping window reads and mitigating storage-level interference are critical to achieving scalable performance under multi-user access. Unlike prior transaction-level concurrency control mechanisms, we adapt the hybrid deterministic-optimistic concept from HDCC \cite{Hong25HDCC} to the thread level, targeting I/O contention resolution for concurrent spatio-temporal range retrievals.
\subsection{I/O Performance Tuning in Storage Systems}
I/O performance tuning has been extensively studied in the context of HPC and data-intensive storage systems, where complex multi-layer I/O stacks expose a large number of tunable parameters. These parameters span different layers, including application-level I/O libraries, middleware, and underlying storage systems, and their interactions often lead to highly non-linear performance behaviors. As a result, manual tuning is time-consuming and error-prone, motivating a wide range of auto-tuning approaches \cite{Peng26IOsurvey}.
Several studies focus on improving the efficiency of the tuning pipeline itself by reformulating the search space or optimization objectives. Chen et al. \cite{Chen21Tuning1} proposed a meta multi-objectivization model that introduces auxiliary performance objectives to mitigate premature convergence to local optima. While such techniques can improve optimization robustness, they are largely domain-agnostic and do not explicitly account for the characteristics of I/O-intensive workloads. Other works, such as the contextual bandit-based approach by Bez et al. \cite{Bez20TuningLayer}, optimize specific layers of the I/O stack (e.g., I/O forwarding) by exploiting observed access patterns. However, these methods are primarily designed for administrator-level tuning and target isolated components rather than end-to-end application I/O behavior \cite{Yang22end-IO}.
User-level I/O tuning has also been explored, most notably by H5Tuner \cite{Behzad13HDF5}, which employs genetic algorithms to optimize the configuration of the HDF5 I/O library. Although effective for single-layer tuning, H5Tuner does not consider cross-layer interactions and lacks mechanisms for reducing tuning cost, such as configuration prioritization or early stopping.
More recently, TunIO \cite{Rajesh24TunIO} proposed an AI-powered I/O tuning framework that explicitly targets the growing configuration spaces of modern I/O stacks. TunIO integrates several advanced techniques, including I/O kernel extraction, smart selection of high-impact parameters, and reinforcement learningdriven early stopping, to balance tuning cost and performance gain across multiple layers. Despite its effectiveness, TunIO and related frameworks primarily focus on single-application or isolated workloads, assuming stable access patterns during tuning. Retrieval-level I/O behaviors, such as fine-grained window access induced by spatio-temporal range retrievals, as well as interference among concurrent users, are generally outside the scope of existing I/O tuning approaches \cite{Wang26RethinkingTuning}. In contrast, we employ the GMAB algorithm \cite{Preil25GMAB}, which introduces a memory mechanism into evolutionary algorithms to permanently preserve historical observations, thereby achieving fast convergence and rapid adaptation to dynamic workload shifts.
\section{Problem Formulation}\label{sec:DF}
This section formalizes the spatio-temporal range retrieval problem and establishes the cost models for retrieval execution. We assume a distributed storage environment where large-scale remote sensing images are stored as objects or files.
Definition 1 (Spatio-temporal Remote Sensing Image). A remote sensing image $R$ is defined as a tuple:
\vspace{-0.05in}
\begin{equation}
\label{eqn:pre_rs}
R=\langle id, \Omega, \mathcal{D}, t \rangle,
\end{equation}
where $id$ is the unique identifier; $\Omega = [0, W] \times [0, H]$ denotes the pixel coordinate space; $\mathcal{D}$ represents the raw pixel data; and $t$ is the temporal validity interval. The image is associated with a spatial footprint $MBR(R)$ in the global coordinate reference system.
Definition 2 (Spatio-temporal Range Retrieval). Given a dataset $\mathbb{R}$, a retrieval query $Q$ is defined by a spatio-temporal predicate $Q = \langle S, T \rangle$, where $S$ is the spatial bounding box and $T$ is the time interval. The retrieval result set $\mathcal{R}_Q$ is defined as:
\vspace{-0.05in}
\begin{equation}
\label{eqn:pre_st_query}
\mathcal{R}_Q=R\in \mathbb{R}\mid MBR\left( R \right) \cap S\ne \emptyset \land R.t\cap T\ne \emptyset .
\end{equation}
For each $R \in \mathcal{R}_Q$, the system must return the pixel matrix corresponding to the intersection region $MBR(R) \cap S$.
Definition 3 (Retrieval Execution Cost Model). The execution latency of a retrieval $Q$, denoted as $Cost(Q)$, is composed of two phases: metadata filtering and data extraction.
\begin{equation}
\label{eqn:cost_total}
Cost\left( Q \right) =C_{meta}\left( Q \right) +\sum_{R\in \mathcal{R}_Q}{\left( C_{geo}\left( R,Q \right) +C_{io}\left( R,Q \right) \right)}.
\end{equation}
Here, $C_{meta}(Q)$ is the cost of identifying candidate images $\mathcal{R}_Q$ using indices. The data extraction cost for each image consists of two components: geospatial computation cost ($C_{geo}$) and I/O access cost ($C_{io}$). $C_{geo}$ is the CPU time required to calculate the pixel-to-geographic mapping, determine the exact read windows (offsets and lengths), and handle boundary clipping. In window-based partial reading schemes, this cost is non-negligible due to the complexity of coordinate transformations. $C_{io}$ is the latency to fetch the actual binary data from storage.
Definition 4 (Concurrent Spatio-temporal Retrievals). Let $\mathcal{Q} = \{Q_1, Q_2, \ldots, Q_N\}$ denote a set of spatio-temporal range retrievals issued concurrently by multiple users.
Each retrieval $Q_i$ independently specifies a spatio-temporal window $\langle S_i, T_i \rangle$ and may overlap with others in both spatial and temporal dimensions. Concurrent execution of $\mathcal{Q}$ may induce overlapping partial reads over the same images or image regions, leading to redundant I/O and storage-level contention if retrievals are processed independently.
\textbf{Problem Statement (Latency-Optimized Concurrent Retrieval Processing).} Given a dataset $\mathbb{R}$ and a concurrent workload $\mathcal{Q}$, the objective is to minimize the total execution latency:
\vspace{-0.05in}
\begin{equation}
\label{eqn_pre_objective}
\min \sum_{Q_i\in \mathcal{Q}} \bigl( C_{\text{meta}}(Q_i)+\sum_{R\in \mathcal{R}_{Q_i}} \bigl( C_{\text{geo}}(R,Q_i) + C_{\text{io}}(R,Q_i) \bigr) \bigr),
\end{equation}
subject to:
\begin{enumerate}
\item \textit{Correctness:} The returned data must strictly match the spatio-temporal predicate defined in Eq. (\ref{eqn:pre_st_query}).
\item \textit{Isolation:} Concurrent reads must effectively share I/O bandwidth without causing starvation or excessive thrashing.
\end{enumerate}
\section{System Overview}\label{sec:Overview}
\begin{figure}
\centering
\includegraphics[width=2.2in]{fig/overview.png}
\caption{The workflow for processing concurrent spatio-temporal range retrievals in the system}
\label{fig:overview}
\end{figure}
To address the challenges of storage-level I/O contention and expensive runtime computations, we propose a layered distributed retrieval framework. As illustrated in Fig. \ref{fig:overview}, the system architecture is composed of four primary processing components: (1) \emph{requst interface}, (2) \emph{index manager}, (3) \emph{I/O coordinator}, (4) \emph{data loader}, and (5) \emph{adaptive tuner}.
The $\emph{requst interface}$ serves as the system entry point. It is responsible for accepting concurrent spatio-temporal retrievals. The $\emph{index manager}$ acts as the planner of the system, interacting with the metadata storage. It translates logical spatio-temporal predicates into physical storage locations using a dual-layer inverted index. The $\emph{I/O coordinator}$ serves as the traffic control layer. It detects spatial overlaps among concurrent reading plans to identify potential I/O conflicts and applies the hybrid concurrency-aware protocol to reorder or merge conflicting requests. Finally, the $\emph{data loader}$ interface with the distributed file system or object store to read the pixel data. What's more, \emph{adaptive tuner} optimizes the execution parameters in the background.
\section{I/O-aware Indexing Structure}\label{sec:Index}
This section introduces the details of the indexing structure for spatio-temporal range retrieval over RS data.
\begin{figure*}[htb]
\centering
\includegraphics[width=0.90\textwidth]{fig/index.png}
\caption{Index schema design}
\label{fig:index}
\end{figure*}
\begin{figure}
\centering
\includegraphics[width=2.2in]{fig/st-query.png}
\caption{Retrieval-time Execution}
\label{fig_ST_Query}
\end{figure}
\subsection{Index schema design}
To enable I/O-efficient spatio-temporal query processing, we first decompose the global spatial domain into a uniform grid that serves as the basic unit for query pruning and data access coordination. Specifically, we adopt a fixed-resolution global tiling scheme based on the Web Mercator (or EPSG:4326) coordinate system, using zoom level 14 to partition the Earths surface into fine-grained grid cells (experiments show that the 14-level grid has the highest indexing efficiency, as discussed in Section~\ref{sec:Index_exp_3}). This resolution strikes a practical balance between spatial selectivity and index size: finer levels would significantly increase metadata volume and maintenance cost, while coarser levels would reduce pruning effectiveness and lead to unnecessary image I/O. At this scale, each grid cell typically corresponds to a spatial extent comparable to common query footprints and to the internal tiling granularity used by modern raster formats, making it well suited for partial data access.
\textbf{Grid-to-Image Mapping (G2I).}
Based on the grid decomposition, we construct a grid-centric inverted index to associate spatial units with covering images. In our system, each grid cell is assigned a unique \textit{GridKey}, encoded as a 64-bit Z-order value to preserve spatial locality and enable efficient range scans in key-value stores such as HBase. The G2I table stores one row per grid cell, where the row key is the GridKey and the value contains the list of image identifiers (ImageKeys) whose spatial footprints intersect the corresponding cell, as illustrated in Fig.~\ref{fig:index}(a).
This grid-to-image mapping allows retrieval processing to begin with a lightweight enumeration of grid cells covered by a retrieval region, followed by direct lookups of candidate images via exact GridKey matches. By treating each grid cell as an independent spatial bucket, the G2I table provides efficient metadata-level pruning and avoids costly geometric intersection tests over large image footprints.
However, the G2I table alone is insufficient for I/O-efficient retrieval execution. While it identifies which images are relevant to a given grid cell, it does not capture how the grid cell maps to pixel regions within each image. As a result, a grid-only representation cannot directly guide partial reads and would still require per-image geospatial computations at retrieval time. Therefore, the G2I table functions as a coarse spatial filter and must be complemented by an image-centric structure that materializes the correspondence between grid cells and pixel windows, enabling fine-grained, window-based I/O.
\textbf{Image-to-Grid Mapping (I2G).}
To complement the grid-centric G2I table and enable fine-grained, I/O-efficient data access, we introduce an image-centric inverted structure, referred to as the Image-to-Grid mapping (I2G). In contrast to G2I, which organizes metadata by spatial grids, the I2G table stores all grid-level access information of a remote sensing image in a single row. Each image therefore occupies exactly one row in the table, significantly improving locality during retrieval execution.
As illustrated in Fig.~\ref{fig:index}(b), the row key of the I2G table is the \textit{ImageKey}, i.e., the unique identifier of a remote sensing image. The row value is organized into three column families, each serving a distinct role in retrieval-time pruning and I/O coordination:
\textit{GridWindow Mapping.}
This column family records the list of grid cells intersected by the image together with their corresponding pixel windows in the image coordinate space. Each entry has the form:
\begin{equation}
\label{eqn_pre_gridkey}
\langle \textit{GridKey}, W_{ImageKey\_GridKey} \rangle,
\end{equation}
where \textit{GridKey} identifies a grid cell at the chosen global resolution, and $W_{ImageKey\_GridKey}$ denotes the minimal pixel bounding rectangle within the image that exactly covers that grid cell.
These precomputed window offsets allow the retrieval executor to directly issue windowed reads on large raster files without loading entire images into memory or recomputing geographic-to-pixel transformations at retrieval time. As a result, grid cells become the smallest unit of coordinated I/O, enabling precise partial reads and effective elimination of redundant disk accesses.
\textit{Temporal Metadata.}
To support spatio-temporal range retrievals, each image row includes a lightweight temporal column family that stores its acquisition time information, such as the sensing timestamp or time interval. This metadata enables efficient temporal filtering to be performed jointly with spatial grid matching, without consulting external catalogs or secondary indexes.
\textit{Storage Pointer.}
This column family contains the information required to retrieve image data from the underlying storage system. It stores a stable file identifier, such as an object key in an object store (e.g., MinIO/S3) or an absolute path in a POSIX-compatible file system. By decoupling logical image identifiers from physical storage locations, this design supports flexible deployment across heterogeneous storage backends while allowing the retrieval engine to directly access image files once relevant pixel windows have been identified.
The I2G table offers several advantages. First, all grid-level access information for the same image is co-located in a single row, avoiding repeated random lookups and improving cache locality during retrieval execution. Second, by materializing grid-to-window correspondences at ingestion time, the system completely avoids expensive per-retrieval geometric computations and directly translates spatial overlap into byte-range I/O requests. Third, the number of rows in the I2G table scales with the number of images rather than the number of grid cells, substantially reducing metadata volume and maintenance overhead.
During data ingestion, the gridwindow mappings are generated by projecting grid boundaries into the image coordinate system using the images georeferencing parameters. This process requires only lightweight affine or RPC transformations and does not involve storing explicit geometries or performing polygon clipping. As a result, the I2G structure enables efficient partial reads while keeping metadata compact and ingestion costs manageable.
\subsection{Retrieval-time Execution}
The I/O-aware index enables efficient spatio-temporal range retrievals by directly translating retrieval predicates into windowed read plans, while avoiding both full-image loading and expensive geometric computations. Given a user-specified spatio-temporal retrieval
$q = \langle [x_{\min}, y_{\min}, x_{\max}, y_{\max}], [t_s, t_e] \rangle$,
the system resolves the retrieval through three consecutive stages: \textit{Grid Enumeration}, \textit{Candidate Image Retrieval with Temporal Pruning}, and \\textit{Windowed Read Plan Generation}. As illustrated in Fig.~\ref{fig_ST_Query}, this execution pipeline bridges high-level retrieval predicates and low-level I/O operations in a fully deterministic manner.
\textbf{Grid Enumeration.}
As shown in Step~1 and Step~2 of Fig.~\ref{fig_ST_Query}, the retrieval execution starts by rasterizing the spatial footprint of $q$ into the fixed global grid at zoom level 14. Instead of performing recursive space decomposition as in quadtrees or hierarchical spatial indexes, our system enumerates the minimal set of grid cells
$\{g_1, \ldots, g_k\}$
whose footprints intersect the retrieval bounding box.
Each grid cell corresponds to a unique 64-bit GridKey, which directly matches the primary key of the G2I table. This design has important implications: grid enumeration has constant depth and low computational cost, and the resulting GridKeys can be directly used as lookup keys without any geometric refinement. Consequently, spatial key generation is reduced to simple arithmetic operations on integer grid coordinates.
\textbf{Candidate Image Retrieval with Temporal Pruning.}
Given the enumerated grid set $\{g_1, \ldots, g_k\}$, the retrieval processor performs a batched multi-get on the G2I table. Each G2I row corresponds to a single grid cell and stores the identifiers of all images whose spatial footprints intersect that cell. For each grid $g_i$, the lookup returns:
\begin{equation}
\label{eqn_pre_lookup_return}
G2I[g_i] = \{ imgKey_1, \ldots, imgKey_m \}.
\end{equation}
All retrieved image identifiers are unioned to form the spatial candidate set
$C_s = \bigcup_{i=1}^{k} G2I[g_i]$.
This step eliminates the need for per-image polygon intersection tests that are commonly required in spatial databases and data cube systems.
To incorporate the temporal constraint $[t_s, t_e]$, each candidate image in $C_s$ is further filtered using the temporal column family of the Image-to-Grid (I2G) table. Images whose acquisition time does not intersect the retrieval interval are discarded early, yielding the final candidate set $C$. This lightweight temporal pruning is performed without accessing any image data and introduces negligible overhead.
\textbf{Windowed Read Plan Generation.}
As shown in Step~3 of Fig.~\ref{fig_ST_Query}, the final stage translates the candidate image set into a concrete I/O plan. For each image $I \in C$, the retrieval executor issues a selective range-get on the I2G table to retrieve only the gridwindow mappings relevant to the retrieval grids:
\begin{equation}
\label{eqn_pre_spatial_query}
I2G\left[ I,\{g_1,...,g_k\} \right] =\left\{ W_{I\_g_i}\mid g_i\cap I\ne \emptyset \right\} .
\end{equation}
Each $W_{I\_g_i}$ specifies the exact pixel window in the original raster file that corresponds to grid cell $g_i$. Since these window offsets are precomputed during ingestion, retrieval execution requires only key-based lookups and arithmetic filtering. No geographic coordinate transformation, polygon clipping, or rastervector intersection is performed at retrieval time.
The resulting collection of pixel windows constitutes a windowed read plan, which can be directly translated into byte-range I/O requests against the storage backend. This approach avoids loading entire scenes and ensures that the total I/O volume is proportional to the retrieved spatial extent rather than the image size.
\subsection{Why I/O-aware}
The key reason our indexing design is I/O-aware lies in the fact that the index lookup results are not merely candidate identifiers, but constitute a concrete I/O access plan. Unlike traditional spatial indexes, where retrieval processing yields a set of objects that must still be fetched through opaque storage accesses, our Grid-to-Image and Image-to-Grid lookups deterministically produce the exact pixel windows to be read from disk. As a result, the logical retrieval plan and the physical I/O plan are tightly coupled: resolving a spatio-temporal predicate directly specifies which byte ranges should be accessed and which can be skipped.
This tight coupling fundamentally changes the optimization objective. Instead of minimizing index traversal cost or result-set size, the system explicitly minimizes data movement by ensuring that disk I/O is proportional to the retrieval's spatio-temporal footprint. Consequently, the index serves as an execution-aware abstraction that bridges retrieval semantics and storage behavior, enabling predictable, bounded I/O under both single-retrieval and concurrent workloads.
\textbf{Theoretical Cost Analysis.}
To rigorously quantify the performance advantage, we revisit the retrieval cost model defined in Eq. (\ref{eqn:cost_total}). In traditional full-image reading systems, although the geospatial computation cost is negligible ($C_{geo} = 0$) as no clipping is performed, the I/O cost $C_{io}$ is determined by the full file size. Consequently, the total latency is entirely dominated by massive I/O overhead, rendering $C_{meta}$ (typically milliseconds) irrelevant.
Existing window-based I/O systems successfully reduce the I/O cost to the size of the requested window. However, this reduction comes at the expense of a significant surge in $C_{geo}$. For every candidate image, the system must perform on-the-fly coordinate transformations and polygon clipping to calculate read offsets. When a retrieval involves thousands of images, the accumulated CPU time ($\sum C_{geo}$) becomes a new bottleneck (e.g., hundreds of milliseconds to seconds), often negating the benefits of I/O reduction (detailed quantitative comparisons are provided in Sec.~\ref{sec:Index_exp_2}).
In contrast, our I/O-aware indexing approach fundamentally alters this trade-off. By materializing the grid-to-pixel mapping in the I2G table, we effectively shift the computational burden from retrieval time to ingestion time. Although the two-phase lookup (G2I and I2G) introduces a slight overhead compared to simple tree traversals, $C_{meta}$ remains in the order of milliseconds—orders of magnitude smaller than disk I/O latency. Since the precise pixel windows are pre-calculated and stored, the runtime geospatial computation is effectively eliminated, i.e., $C_{geo} = 0$. The system retains the minimal I/O cost characteristic of window-based approaches, fetching only relevant byte ranges. Therefore, our design achieves the theoretical minimum for both computation and I/O components within the retrieval execution critical path.
\section{Hybrid Concurrency-Aware I/O Coordination}\label{sec:CC}
In this section, we propose a hybrid coordination mechanism that adaptively employs either non-deterministic execution or deterministic coordinated scheduling based on the real-time contention level of spatio-temporal workloads. Fig.~\ref{fig:cc} provides an overview of the entire coordination workflow.
\begin{figure*}
\centering
\includegraphics[width=0.90\textwidth]{fig/cc.png}
\caption{Hybrid Concurrency-Aware I/O Coordination.}
\label{fig:cc}
\end{figure*}
\subsection{Retrieval Admission and I/O Plan Generation}
When a spatio-temporal range retrieval $Q$ arrives, the system first performs index-driven plan generation. The retrieval footprint is rasterized into the global grid to enumerate the intersecting grid cells. The G2I table is then consulted to retrieve the set of candidate images, followed by selective lookups in the I2G table to obtain the corresponding pixel windows.
As a result, each retrieval is translated into an explicit \textit{I/O access plan} consisting of imagewindow pairs:
\vspace{-0.05in}
\begin{equation}
\label{eq:io_plan}
Plan\left( Q \right) =\left\{ \left( img_1,w_1 \right) ,\left( img_1,w_2 \right) ,\left( img_3,w_5 \right) ,... \right\},
\end{equation}
where each window $w$ denotes a concrete pixel range to be accessed via byte-range I/O. Upon admission, the system assigns each retrieval a unique \textit{RetrievalID} and records its arrival timestamp.
\subsection{Contention Estimation and Path Selection}
To minimize the overhead of global ordering in low-contention scenarios, the system introduces a Contention-Aware Switch. Upon the arrival of a retrieval batch $\mathcal{Q} = \{Q_1, Q_2, ..., Q_n\}$, the system first estimates the Spatial Overlap Ratio ($\sigma$) among their generated I/O plans.
Let $A(Plan(Q_i))$ be the aggregate spatial area of all pixel windows in the I/O plan of retrieval $Q_i$. The overlap ratio $\sigma$ for a batch is defined as:
\vspace{-0.05in}
\begin{equation}
\vspace{-0.05in}
\label{eqn_tuning_table}
\sigma = 1 - \frac{\text{A}(\bigcup_{i=1}^n Plan(Q_i))}{\sum_{i=1}^n \text{A}(Plan(Q_i))},
\end{equation}
where $\sigma \in [0, 1]$. A high $\sigma$ indicates that multiple retrievals are competing for the same image regions, leading to high I/O amplification if executed independently.
The system utilizes a rule-based assignment mechanism similar to HDCC \cite{Hong25HDCC} to select the execution path:
\begin{enumerate}
\item Path A (Non-deterministic/OCC-style): If $\sigma < \tau$ (where $\tau$ is a configurable threshold), retrievals proceed directly to execution to maximize concurrency.
\item Path B (Deterministic/Calvin-style): If $\sigma \ge \tau$, retrievals are routed to the Global I/O Plan Queue for coordinated merging.
\end{enumerate}
\subsection{Deterministic Coordinated and Non-deterministic Execution}
When $\sigma \ge \tau$, the system switches to a deterministic path to mitigate storage-level contention and I/O amplification, as shown in Fig.~\ref{fig:cc}. To coordinate concurrent access to shared storage resources, we introduce a Global I/O Plan Queue that enforces a globally consistent ordering over all admitted I/O plans. Each windowed access $(img, w)$ derived from incoming retrievals is inserted into this queue according to a predefined policy, such as FIFO based on arrival time or lexicographic ordering by $(timestamp, RetrievalID)$. This design is inspired by deterministic scheduling in systems such as Calvin~\cite{Thomson12Calvin}, but differs fundamentally in scope: ordering is imposed on window-level I/O operations rather than on transactions. As a result, accesses to the same image region across different retrievals follow a globally consistent order, preventing uncontrolled interleaving and reducing storage-layer contention. The deterministic ordering also provides a stable foundation for subsequent I/O coordination and sharing.
The coordination mechanism operates through three stages within each scheduling interval:
\emph{Stage 1: Global De-duplication.} The system extracts all windowed access pairs $(img, w)$ from admitted retrievals and inserts them into a global window set $\mathcal{W}_{total}$. If multiple retrievals request the same pixel window $w$ from image $img$, only one unique entry is retained, preventing redundant retrieval of overlapping spatial grids.
\emph{Stage 2: Range Merging.} After de-duplication, the system analyzes physical disk offsets of all unique windows in $\mathcal{W}_{total}$. Windows that are physically contiguous or separated by a gap smaller than threshold $\theta$ are merged into a single read operation, improving access locality and reducing IOPS.
\emph{Stage 3: Dispatching.} The dispatcher maintains mappings between physical byte offsets in the shared buffer and logical window requirements of each retrieval. Each retrieval $Q_i$ receives only the exact pixel windows $w \in Plan(Q_i)$ it originally requested, achieved via zero-copy memory mapping or buffer slicing. This ensures that while physical I/O is shared to reduce amplification, logical execution of each retrieval remains independent and free from irrelevant data interference.
For example, when $Q_1$ requests grids $\{1, 2\}$ and $Q_2$ requests grids $\{2, 3\}$, Stage 1 identifies the unique requirement set $\{1, 2, 3\}$. Stage 2 merges these into a single contiguous I/O operation covering $[1, 3]$. In Stage 3, the dispatcher extracts memory offsets for grids $1$ and $2$ and delivers them to $Q_1$, while extracting and delivering grids $2$ and $3$ to $Q_2$. Through these mechanisms, concurrent retrievals collaboratively share I/O without requiring transaction-level synchronization.
When contention remains below the threshold ($\sigma < \tau$), the system prioritizes low latency by adopting optimistic dispatch, as shown in Fig.~\ref{fig:cc}. Instead of incurring coordination overhead, I/O plans are immediately offloaded to the execution engine with thread-local sublists, each independently handling its byte-range requests. This adaptive switching ensures that the system operates optimistically under low contention while engaging deterministic coordination to prevent thrashing when overlap becomes significant.
\subsection{Optimistic Read Execution and Completion}
Once a coordinated window read is scheduled, the system issues the corresponding byte-range I/O request immediately. Read execution is fully optimistic: there is no validation phase, no abort, and no rollback. This is enabled by the immutability of remote-sensing imagery and by the deterministic ordering of I/O plans, which together ensure consistent and repeatable read behavior.
A retrieval is considered complete when all windows in its I/O plan have been served and the associated local processing (e.g., reprojection or mosaicking) has finished. By eliminating validation overhead and allowing read execution to proceed independently once scheduled, the system achieves low-latency retrieval completion while maintaining predictable I/O behavior under concurrency.
Overall, this concurrency-aware I/O coordination mechanism reinterprets concurrency control as a problem of coordinating shared I/O flows. By operating at the granularity of windowed reads and leveraging deterministic ordering and optimistic execution, it effectively reduces redundant I/O and improves scalability for multi-user spatio-temporal retrieval workloads.
\subsection{Why Coordination Works}
\begin{figure}
\centering
\includegraphics[width=3.0in]{fig/unlock.png}
\caption{Illustration of retrieval execution models for overlapping spatio-temporal range queries}
\label{fig:unlock}
\end{figure}
Fig.~\ref{fig:unlock} illustrates the fundamental distinction between non-deterministic and deterministic execution models, demonstrating how our coordination mechanism ``unlocks'' concurrent I/O performance under high contention.
As shown in Fig.~\ref{fig:unlock}(a), when multiple overlapping retrievals $Q_1, Q_2, Q_3$ execute independently without coordination, they compete for the same underlying storage resources. This uncoordinated access pattern leads to severe I/O contention: $Q_2$ and $Q_3$ must wait for $Q_1$ to release locks on shared image regions, resulting in unpredictable serialization and degraded throughput. More critically, each retrieval independently performs identical geospatial computations and issues separate read requests for overlapping spatial regions. The timeline on the right of Fig.~\ref{fig:unlock}(a) reveals the consequences: redundant disk seeks, repeated lock acquisitions, and duplicated data transfers. This execution model suffers computational redundancy caused by repeated coordinate transformations, and I/O amplification due to multiple physical reads of the same byte ranges. Under high concurrency, this uncoordinated thrashing leads to super-linear latency growth.
Fig.~\ref{fig:unlock}(b) illustrates how our deterministic coordination transforms competitive I/O into collaborative I/O. Instead of treating retrievals as isolated transactions, the system admits overlapping retrievals as a unified batch (enclosed by the red bounding box) and enforces a deterministic execution order. By merging overlapping window requests from $Q_1, Q_2, Q_3$ into a single globally-ordered I/O plan, the system converts $N$ concurrent random reads into one sequential scan. As shown in the timeline of Fig.~\ref{fig:unlock}(b), the coordinated approach requires only one lock acquisition and one contiguous disk read, eliminating seek overhead and reducing rotational latency in spinning disks. This serialization advantage is particularly significant for parallel file systems where small random reads suffer from metadata amplification. What's more, the deterministic scheduler analyzes all admitted retrievals and identifies overlapping spatial windows before issuing any I/O. When multiple retrievals request the same pixel region, the system retains only one unique entry in the global I/O plan, effectively collapsing logical demand $N$ into physical execution $1 \le k \le N$.
\section{I/O Stack Tuning}\label{sec:Tuning}
We first describe the I/O stack tuning problem and then propose the surrogate-assisted GMAB algorithm to solve the problem.
\subsection{Formulation of Online I/O Tuning}
We study a concurrent spatio-temporal retrieval engine that processes many range retrievals simultaneously. The system operates on large remote sensing images stored in shared storage. Unlike traditional HPC jobs or single-application I/O workloads, the system does not run one fixed job. Instead, it continuously receives a stream of user retrievals. Each retrieval is turned into many small I/O operations that often touch overlapping regions in large raster files.
Let $\mathcal{Q} = \{Q_1, Q_2, \ldots, Q_N\}$ denote a stream of spatio-temporal range retrievals submitted by multiple users. Each retrieval $q$ is decomposed by the I/O-aware index into a set of grid-aligned spatial windows based on a predefined global grid system. These windows are further mapped to sub-regions of one or more large remote sensing images. In this way, every retrieval produces an I/O execution context $c= \langle W,M,S \rangle$, where $W$ describes the set of image windows to be accessed, including their sizes, spatial overlap, and distribution across images; $M$ captures window-level coordination opportunities, such as window merging, deduplication, or shared reads across concurrent retrievals; and $S$ represents system-level execution decisions, including batching strategies, I/O scheduling order, and concurrency limits. Importantly, the I/O behavior of the system is not determined solely by static application code, but emerges dynamically from the interaction between retrieval workloads, execution plans, and system policies.
The goal of I/O tuning in this system is to optimize the performance of retrieval-induced I/O execution under continuous, concurrent workloads. We focus on minimizing the observed I/O cost per retrieval, which may be measured by metrics such as average retrieval latency, effective I/O throughput, or amortized disk read time. Let $\theta \in \varTheta$ denote a tuning configuration, where each configuration specifies a combination of system-level I/O control parameters, including window batching size, merge thresholds, queue depth, concurrency limits, and selected storage-level parameters exposed to the engine. Unlike traditional I/O tuning frameworks, the decision variables $\theta$ are applied at the retrieval execution level, rather than at application startup or compilation time.
For a given tuning configuration $\theta $ and execution context $c$, the observed I/O performance is inherently stochastic due to: interference among concurrent retrievals; shared storage contention; variability in window overlap and access locality. We model the observed performance outcome as a random variable:
\vspace{-0.05in}
\begin{equation}
\vspace{-0.05in}
\label{eqn_tuning_table}
Y\left( \theta ,c \right) =f\left( \theta ,c \right) +\epsilon ,
\end{equation}
where $f\left( \cdot \right) $ is an unknown performance function and $\epsilon$ captures stochastic noise. Moreover, as retrieval workloads evolve over time, the distribution of execution contexts $c$ may change, making the tuning problem non-stationary.
Given a stream of retrievals $\mathcal{Q}$ and the resulting sequence of execution contexts $\left\{ c_t \right\} $, the problem is to design an online tuning strategy that adaptively selects tuning configurations $\theta _t$ for retrieval execution, so as to minimize the long-term expected I/O cost:
\vspace{-0.05in}
\begin{equation}
\vspace{-0.05in}
\label{eqn_tuning_table}
\min_{\left\{ \theta _t \right\}}\mathbb{E}\left[ \sum_{t=1}^T{Y}\left( \theta _t,c_t \right) \right] ,
\end{equation}
subject to practical constraints on tuning overhead and system stability.
\subsection{Surrogate-Assisted GMAB for Online I/O Tuning}
\begin{algorithm}[!htb]
\caption{Surrogate-Assisted Genetic Multi-Armed Bandit (SA-GMAB)}
\label{alg:sa-gmab}
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
\Input{Configuration space $\Theta$, Initial population size $P$, Exploration parameter $\alpha$, Surrogate update interval $\Delta$}
\Output{Online selection of I/O coordination configuration $\theta_t$}
\BlankLine
\tcp{Initialization}
Initialize memory table $\mathcal{M} = \emptyset$\;
Initialize surrogate model $\tilde{f}$ with empty training data\;
Generate an initial population $\mathcal{P}_0 \subset \Theta$\;
Set tuning step counter $t \leftarrow 0$\;
\BlankLine
\tcp{Online Tuning Loop}
\While{arrival of retrieval $q_t$ with execution context $c_t$}{
\tcp{Candidate Generation}
Apply genetic operators (selection, crossover, mutation) on current population to generate candidate set $\mathcal{C}_t \subset \Theta$\;
\tcp{Surrogate-based Pre-evaluation}
\ForEach{$\theta \in \mathcal{C}_t$}{
$\hat{r}_\theta \leftarrow \tilde{f}(\theta, c_t)$\;
}
\tcp{Candidate Filtering}
Select top-$K$ configurations $\mathcal{C}'_t \subset \mathcal{C}_t$ based on $\hat{r}_\theta$ or uncertainty\;
\tcp{Bandit-based Selection}
\ForEach{$\theta \in \mathcal{C}'_t$}{
$\text{Score}(\theta) = \hat{\mu}_\theta + \alpha \sqrt{\frac{\log(t+1)}{n_\theta + 1}}$\;
}
Select configuration: $\theta_t = \arg\max_{\theta \in \mathcal{C}'_t} \text{Score}(\theta)$\;
\tcp{Retrieval Execution \& Reward Observation}
Execute retrieval $q_t$ using I/O coordination policy $\theta_t$\;
Measure performance outcome and compute reward $r_t$\;
\tcp{State Update}
Update memory entry for $\theta_t$: $n_{\theta_t} \leftarrow n_{\theta_t} + 1$\;
$\hat{\mu}_{\theta_t} \leftarrow \hat{\mu}_{\theta_t} + \frac{r_t - \hat{\mu}_{\theta_t}}{n_{\theta_t}}$\;
Update population $\mathcal{P}$ by inserting $\theta_t$ (optionally evicting low-performing ones)\;
\If{$t \bmod \Delta = 0$}{
Retrain surrogate model $\tilde{f}$ using observations in $\mathcal{M}$\;
}
$t \leftarrow t + 1$\;
}
\end{algorithm}
To address the online I/O tuning problem, we use a Surrogate-Assisted Genetic Multi-Armed Bandit (SA-GMAB) framework. It combines genetic search, bandit-style exploration, and a simple performance model. The goal is to handle workloads where behavior changes over time, where results are random, and where retrievals may affect each other. The main steps of this framework are shown in Algorithm~\ref{alg:sa-gmab}.
We first initialize the memory table and the surrogate model, and then generate an initial population of configurations (lines 1--4). In our system, each arm is an I/O tuning configuration $\theta \in \varTheta$. A configuration is a group of I/O control parameters, such as merge thresholds, batch size, queue depth, and limits on parallel requests. The space of possible configurations is large and discrete. It is not possible to list or test all of them. Therefore, we do not fix all arms in advance. Instead, new configurations are created dynamically by genetic operators during candidate generation (line 6). Each configuration acts as a policy that tells the system how to run I/O plans during a scheduling period.
When a retrieval $q_t$ with context $c_t$ arrives, the framework enters the online tuning loop (line 5). For this retrieval, a set of candidate configurations is created through selection, crossover, and mutation (line 6). For every candidate configuration, the surrogate model predicts its reward under the current context (lines 7--9). These predicted rewards are then used to filter and keep only the top promising configurations, or those with high uncertainty (line 10).
When a configuration $\theta$ is used to process a retrieval $q_t$ with context $c_t$, the system observes a random performance result $Y_t=Y\left( \theta ,c_t \right)$. We define the reward as a simple transformation of I/O cost so that a higher reward means better performance. A common form is the negative latency of the retrieval, or the negative I/O time per unit work. Because other retrievals run at the same time, the reward may change even for the same configuration. Thus, many samples are needed to estimate the expected reward.
For the remaining candidates, the framework computes a bandit score using both historical average reward and exploration term (lines 11--13), and then selects the configuration with the highest score (line 14). In this way, the method prefers configurations that have performed well before, but it also tries configurations that have been used only a few times.
The selected configuration is then applied to execute the retrieval (line 15). After execution, the system observes the performance result and converts it into a reward value (line 16). For each configuration $\theta$, the system keeps a memory entry that records how many times it has been used and its average reward. These values are updated after each execution (lines 17--18). This keeps all historical observations instead of discarding older ones, so estimates become more accurate over time, and poor configurations are not repeatedly tried.
The selected configuration may also be added into the population, while poor ones may be removed (line 19). The surrogate model is retrained periodically using data stored in memory (lines 20--22), so that its predictions follow the most recent workload. The tuning step counter is then increased (line 23), and the framework continues with the next retrieval (line 24).
\section{Performance Evaluation}\label{sec:EXP}
First, we introduce the experimental setup, covering the dataset characteristics, retrieval workload generation, and the distributed cluster environment. Then, we present the experimental results evaluating the proposed I/O-aware indexing structure, the hybrid concurrency-aware I/O coordination mechanism, and the online I/O tuning framework, respectively.
\subsection{Experimental Setup}
\subsubsection{Dataset}
We employed a large-scale, multi-source planetary remote sensing dataset derived from major Martian exploration missions to evaluate the system under heterogeneous workloads. Specifically, the dataset integrates imagery from the Tianwen-1 Medium Resolution Imaging Camera (MoRIC), the Context Camera (CTX), the Thermal Emission Imaging System (THEMIS), and the High Resolution Imaging Science Experiment (HiRISE). These datasets encompass a diverse range of spatial resolutions and spectral bands, covering extensive global Martian surface features. To facilitate efficient window-based I/O and random access, all raw planetary data products were pre-processed and converted into the Cloud-Optimized GeoTIFF (COG) format. The detailed specifications and statistics of the dataset are summarized in Table~\ref{table_dataset}.
% table 1: Dataset
\begin{table}
\renewcommand{\arraystretch}{1.3}
\caption{Dataset Statistics}
\label{table_dataset}
\vspace{-0.13in}
\centering
\begin{tabular}{|m{1.25cm}|m{1.25cm}|m{1.5cm}|m{1.75cm}|m{1cm}|}
\hline
\makecell[c]{\textbf{Dataset}} &\bfseries Resolution & \bfseries Time Span & \bfseries Total Volume & \bfseries Number \\
\hline
\hline
\makecell[c]{MoRIC}&\makecell[c]{76m} & \makecell[c]{2021--2022} & \makecell[c]{1.1 TB}&\makecell[c]{12060}\\
\hline
\makecell[c]{CTX}&\makecell[c]{5m} & \makecell[c]{2007--2024} & \makecell[c]{3.1 TB}&\makecell[c]{3960}\\
\hline
\makecell[c]{THEMIS}&\makecell[c]{100m} & \makecell[c]{2003--2024} & \makecell[c]{6.5 TB}&\makecell[c]{669641}\\
\hline
\makecell[c]{HiRISE}&\makecell[c]{0.5m} & \makecell[c]{2007--2024} & \makecell[c]{41.2 TB}&\makecell[c]{96693}\\
\hline
\end{tabular}
\end{table}
\subsubsection{Retrieval Workload}
To evaluate the system performance under diverse scenarios, we developed a synthetic workload generator that simulates concurrent spatio-temporal range retrievals. The retrieval parameters are configured as follows:
\begin{itemize}
\item Spatial Extent: The spatial range of retrievals follows a log-uniform distribution, ranging from small tile-level access ($0.0001\%$ of the scene) to large-scale regional mosaics ($1\%$ to $100\%$ of the scene).
\item Temporal Range: Each retrieval specifies a time interval randomly chosen between 1 day and 1 month.
\item Concurrency \& Contention: The number of concurrent clients $N$ varies from 1 to 64. To test the coordination mechanism, we control the Spatial Overlap Ratio $\sigma \in [0, 0.9]$ to simulate workloads ranging from disjoint access to highly concentrated hotspots.
\end{itemize}
It is worth noting that, given the data-intensive nature of retrievals where a single request triggers GB-scale I/O and complex decoding, 64 concurrent streams are sufficient to fully saturate the aggregate I/O bandwidth and CPU resources of our experimental cluster. With 8 worker nodes connected via 10GbE, a concurrency of 64 implies an average of 8 heavy I/O threads per node. Previous characterization studies on Lustre-based supercomputers \cite{Xie12supercomputer} have revealed that client-side flow control typically limits in-flight RPCs to 8 concurrent requests and that exceeding this parallelism level exacerbates resource contention and straggler effects. Therefore, this setting represents a realistic heavy-load scenario where I/O interference significantly impacts performance.
\subsubsection{Experimental Environment}
\label{sec_exp_env}
All experiments are conducted on a cluster with 9 homogenous nodes (1 master node and 8 worker nodes). The cluster is connected via a 10Gbps high-speed Ethernet to ensure that network bandwidth is not the primary bottleneck compared to storage I/O. Table \ref{table_config} lists the detailed hardware and software configurations. The I/O-aware index (G2I/I2G) is deployed on HBase, while the raw image data is served by the Lustre parallel file system.
% table 2: Environment
\begin{table}
\renewcommand{\arraystretch}{1.3}
\caption{Cluster Configurations}
\label{table_config}
\vspace{-0.13in}
\centering
\begin{tabular}{|m{2.2cm}|m{5.5cm}|}
\hline
\multicolumn{2}{|c|}{\textbf{Hardware Configuration (Per Node)}} \\
\hline
\makecell[c]{CPU} & Intel Xeon Gold 6248 (20 cores,
2.50GHz)\\
\hline
\makecell[c]{Memory} & \makecell[c]{128GB}\\
\hline
\makecell[c]{Storage} & \makecell[c]{18TB NVMe SSD (Data) + 500GB SSD (OS)}\\
\hline
\makecell[c]{Network} & \makecell[c]{10 Gigabit Ethernet (10GbE)}\\
\hline\hline
\multicolumn{2}{|c|}{\textbf{Software Stack}} \\
\hline
\makecell[c]{OS} & \makecell[c]{Ubuntu 22.04 LTS} \\
\hline
\makecell[c]{Storage} & \makecell[c]{Hadoop 3.2.1, HBase 2.4.5, Lustre}\\
\hline
\makecell[c]{Framework} & \makecell[c]{OpenJDK 11}\\
\hline
\end{tabular}
\end{table}
\subsection{Evaluating the Data Indexing Structure}
To comprehensively evaluate the effectiveness of the proposed I/O-aware indexing structure, we conducted experiments on a single cluster node, as each node independently performs indexing for spatial retrieval in the distributed setting. We compare our approach against five representative baseline systems that span traditional database indexes, distributed NoSQL-based schemes, and state-of-the-art windowed I/O frameworks.
The comparative methods are categorized as follows:
\begin{enumerate}
\item \textbf{PostGIS (Full-file Retrieval):} A traditional relational database approach that employs R-tree spatial indexes for metadata filtering. While it efficiently identifies candidate images through spatial intersection tests, it retrieves entire image files during data extraction, incurring substantial I/O overhead even for small spatial queries.
\item \textbf{GeoMesa (Full-file Retrieval):} A distributed spatio-temporal index built on Hbase, which encodes spatial footprints using Z-order space-filling curves for scalable metadata discovery. Despite its superior indexing performance for billion-scale datasets, it still relies on full-file data loading.
\item \textbf{MSTGI (Full-file Retrieval):} A recently proposed multi-scale spatio-temporal grid index model \cite{liu24mstgi} that enhances GeoMesa through hierarchical time granularity (year/month/day) and Hilbert curve-based linearization. It inherits the full-file retrieval limitation, where data extraction cost remains decoupled from index-level optimization.
\item \textbf{OpenDataCube (Window-based I/O):} A state-of-the-art data cube system that couples PostGIS indexes with windowed I/O via rasterio, enabling partial reads from monolithic image files. By leveraging GeoBox-based ROI computation and automatic overview selection, OpenDataCube represents the theoretical optimum for I/O selectivity but incurs runtime geospatial computation overhead to resolve pixel-to-geographic mappings.
\item \textbf{Rio-tiler (Window-based I/O):} A lightweight raster reading engine optimized for dynamic tile generation. Similar to OpenDataCube, it employs PostGIS for spatial indexing and windowed I/O for partial data access, but features a streamlined execution path with minimal abstraction layers, resulting in lower per-query overhead. rio-tiler serves as a high-performance baseline for windowed reading without the complexity of full data cube management.
\item \textbf{Ours (I/O-aware Indexing):} The proposed approach leverages a dual-layer inverted index structure comprising Grid-to-Image (G2I) and Image-to-Grid (I2G) mappings. By pre-materializing grid-to-pixel correspondences at ingestion time, our method translates spatio-temporal predicates directly into byte-level read plans, completely eliminating runtime geometric computations while preserving minimal I/O volume through precise windowed access.
\end{enumerate}
\subsubsection{I/O Selectivity Analysis}\label{sec:Index_exp_1}
\begin{figure}[tb]
\centering
\subfigure[Query footprint ratios]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.95\textwidth]{exp/index_exp1_1.pdf}
\end{minipage}
}
\label{fig:index_exp1_1}
\subfigure[Query spatial extents]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.95\textwidth]{exp/index_exp1_2.pdf}
\end{minipage}
}
\label{fig:index_exp1_2}
\caption{The efficiency of I/O selectivity}
\label{fig:index_exp1}
\end{figure}
First, we evaluated the effectiveness of data reduction by measuring the I/O selectivity, defined as the ratio of the retrieved data volume to the total file size. Fig.~\ref{fig:index_exp1} compares our method against baselines. As illustrated in Fig.~\ref{fig:index_exp1}(a), systems such as PostGIS, GeoMesa, and MSTGI, which rely on full file loading, exhibit consistent performance. They always reads the entire image regardless of the proportion of the intersection between the query range and the image. In contrast, OpenDataCube, Rio-tiler, and ours significantly reduce I/O traffic by enabling partial reads. It is worth noting that our method incurs slightly higher I/O volume compared to the theoretically optimal baseline (OpenDataCube and Rio-tiler). This marginal data redundancy is attributed to the grid alignment effect: our index retrieves pixel blocks based on fixed grid boundaries, whereas OpenDataCube and Rio-tiler perform precise geospatial clipping. Fig.~\ref{fig:index_exp1}(b) further presents the distribution of unnecessary data fraction. While our method introduces a small amount of over-reading due to grid padding, it successfully avoids the massive data waste observed in the full-file retrieval systems. As we will demonstrate in the next section, this slight compromise in I/O precision is a strategic trade-off that eliminates expensive runtime computations.
\subsubsection{End-to-End Retrieval Latency}\label{sec:Index_exp_2}
\begin{figure}[tb]
\centering
\subfigure[Query footprint ratios]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.95\textwidth]{exp/index_exp2_1.pdf}
\end{minipage}
}
\label{fig:index_exp2_1}
\subfigure[Query footprint ratios]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.95\textwidth]{exp/index_exp2_2.pdf}
\end{minipage}
}
\label{fig:index_exp2_2}
\caption{End-to-End retrieval latency}
\label{fig:index_exp2}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=1.8in]{exp/index_exp2_3.pdf}
\caption{Latency breakdown}
\label{fig:index_exp2_3}
\end{figure}
We next measured the end-to-end retrieval latency to verify whether I/O reduction effectively translates into time efficiency across different indexing and retrieval strategies. Fig.~\ref{fig:index_exp2}(a) reports the mean latency, and Fig.~\ref{fig:index_exp2}(b) shows the 95th percentile (p95) across varying retrieval footprint ratios. The results reveal three distinct performance categories, each corresponding to a fundamental architectural design choice.
As shown in Fig.~\ref{fig:index_exp2}(a), all three full-file baselines (PostGIS, GeoMesa, MSTGI) exhibit high and flat latency curves ranging from approximately 4,480 to 4,850 ms, nearly independent of the query footprint ratio. This performance ceiling is imposed by the necessity of transferring complete image files regardless of the spatial extent requested. Among these methods, PostGIS achieves the lowest latency ($\approx 4,500$ ms) due to the efficiency of R-tree traversal for metadata filtering. GeoMesa incurs slightly higher latency ($\approx 4,700$--4,850 ms) as a result of distributed query coordination overhead in its Hbase-backed architecture. MSTGI positions between these two ($\approx 4,600$--4,750 ms), reflecting its multi-scale indexing optimization that partially compensates for the underlying GeoMesa framework. Critically, all three methods are dominated by massive I/O transfer that entirely masks computational overheads, rendering index-level optimizations ineffective for end-to-end latency. In contrast, both OpenDataCube and Rio-tiler demonstrate strong correlation between retrieval latency and query footprint, confirming that partial reading successfully eliminates unnecessary data transfer. For small tile-level retrievals (footprint ratio $\le 10^{-3}$), rio-tiler achieves latencies of 34--41 ms, while OpenDataCube ranges from 43--52 ms. This 15--20\% advantage of rio-tiler stems from its streamlined execution path with minimal abstraction layers. However, as the query footprint grows beyond $10^{-2}$, both methods exhibit linear latency growth, reaching 5,090 ms (rio-tiler) and 5,270 ms (OpenDataCube) at full-image queries. The floor latency of approximately 380 ms for small retrievals indicates a fixed computational cost component that is not attributable to I/O. Our method achieves the lowest latency across all query scales. For typical tile-level retrievals (footprint ratio $10^{-4}$ to $10^{-2}$), latency ranges from 32 to 114 ms, representing a 3.8--12$\times$ speedup over rio-tiler and a 4.2--13$\times$ improvement over OpenDataCube. Crucially, our method preserves near-constant latency for small-to-medium queries and only exhibits noticeable growth when the footprint ratio exceeds $0.1$. At the extreme of full-image queries, our approach reaches 4,525 ms, comparable to full-file methods, as the grid-alignment overhead becomes negligible relative to complete data transfer.
To empirically validate the cost model in Eq.~\ref{eqn:cost_total}, we decomposed the retrieval latency into three components: metadata lookup ($C_{meta}$), geospatial computation ($C_{geo}$), and I/O access ($C_{io}$). Fig.~\ref{fig:index_exp2_3} presents the breakdown for a representative medium-scale retrieval involving approximately 50 image tiles. PostGIS, GeoMesa, and MSTGI all exhibit $C_{io}$-dominated profiles, with I/O access consuming approximately 1,200 ms and accounting for over 98\% of total latency. PostGIS maintains the lowest $C_{meta}$ (20 ms) due to its single-node R-tree structure. GeoMesa incurs higher $C_{meta}$ (120 ms) from distributed metadata coordination. MSTGI achieves intermediate $C_{meta}$ (70 ms) by optimizing the query path through its hierarchical time-granularity design. All three methods maintain $C_{geo} \approx 0$ since no geometric computation is performed—entire images are returned directly. Both OpenDataCube and rio-tiler successfully reduce $C_{io}$ to approximately 550--580 ms by reading only intersecting windows. However, this I/O advantage is partially offset by substantial $C_{geo}$ overhead (350 ms, representing 38--40\% of total latency), incurred by runtime coordinate transformations and precise clipping computations. The nearly identical $C_{meta}$ (20 ms) reflects their shared reliance on PostGIS for spatial indexing. Our method achieves a balanced profile with $C_{meta} = 35$ ms (slightly higher than windowed baselines due to the two-phase G2I+I2G lookup) and $C_{io} = 600$ ms (comparable to windowed I/O methods). Critically, $C_{geo}$ is completely eliminated by pre-materialized grid-to-pixel mappings, resulting in a total latency of 635 ms. This represents a 1.7$\times$ speedup over OpenDataCube (1,070 ms) and a 1.6$\times$ improvement over rio-tiler (970 ms). The decomposition validates our core thesis: by shifting computational burden from retrieval time to ingestion time, I/O-aware indexing achieves near-optimal I/O cost while avoiding the computational bottleneck that plagues existing windowed I/O systems.
\subsubsection{Ablation Study}\label{sec:Index_exp_3}
\begin{figure}[tb]
\centering
\subfigure[I/O reduction analysis]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.9\textwidth]{exp/index_exp3_1.pdf}
\end{minipage}
}
\label{fig:index_exp3_1}
\subfigure[Latency breakdown]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.9\textwidth]{exp/index_exp3_2.pdf}
\end{minipage}
}
\label{fig:index_exp3_2}
\caption{Ablation analysis}
\label{fig:index_exp3}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=1.8in]{exp/index_exp3_3.pdf}
\caption{Impact of grid resolution on query latency}
\label{fig:index_exp3_3}
\end{figure}
To quantify the individual contributions of the G2I (coarse filtering) and I2G (fine-grained access) components, we decomposed the system into four variants. Fig.~\ref{fig:index_exp3} breaks down the performance in terms of I/O volume and latency components. Fig.~\ref{fig:index_exp3}(a) confirms that removing either component leads to suboptimal I/O behavior. The \textit{No Index} and \textit{G2I Only} variants result in 100\% I/O volume, as they lack the window information required for partial access. Conversely, \textit{I2G Only} and \textit{G2I+I2G} achieve minimal I/O volume ($\approx 10\%$).
Fig.~\ref{fig:index_exp3}(b) reveals the latency breakdown. \textit{No Index} suffers from both high full table scanning cost and high storage I/O cost. \textit{G2I Only} efficiently reduces metadata lookup time ($\approx 50$ ms) but fails to reduce storage I/O ($\approx 8000$ ms). Although \textit{I2G Only} minimizes storage I/O ($\approx 100$ ms), it incurs prohibitive metadata lookup overhead ($\approx 1500$ ms) because the system must scan the entire I2G table to identify relevant images without spatial pruning. \textit{G2I+I2G} achieves the best performance, maintaining low metadata latency ($\approx 60$ ms) via G2I pruning while ensuring minimal storage I/O ($\approx 100$ ms) via I2G windowing.
Moreover, the choice of grid resolution (Zoom Level) is a critical parameter that dictates the trade-off between metadata management overhead and I/O precision. To justify our selection of Zoom Level 14, we conducted a sensitivity analysis by varying the grid resolution from Level 12 to Level 16 under a fixed workload of medium-scale range queries. Fig.~\ref{fig:index_exp3_3} illustrates the latency breakdown across different resolutions. The results reveal a clear convex trajectory in total query latency, driven by two opposing forces. For coarse-grained grids (Level $\le 13$), while metadata lookup is extremely fast ($C_{meta} < 30$ ms) due to the small number of grid keys, the I/O cost is prohibitively high. Large grid cells force the system to read significant amounts of irrelevant pixel data outside the actual query boundary, serving as the dominant bottleneck. Conversely, finer grids (Level 15, 16) maximize I/O precision, reducing $C_{io}$ to its theoretical minimum. However, this comes at the expense of increased metadata volume. A single query may intersect thousands of Level 16 micro-grids, causing $C_{meta}$ to surge drastically ($>100$ ms) due to the overhead of scanning and processing massive key lists in the G2I/I2G tables. As evidenced by the trough in the total latency curve, Zoom Level 14 represents the optimal spot for our dataset. At this resolution, the grid cell size roughly matches the typical internal tile size of remote sensing images, keeping I/O waste low while maintaining a manageable number of index keys. Consequently, our system adopts Level 14 as the default global configuration.
\subsubsection{Index Construction and Storage Overhead}
\begin{figure}[tb]
\centering
\subfigure[Ingested images ($10^4$)]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.98\textwidth]{exp/index_exp4_2.pdf}
\end{minipage}
}
\label{fig:index_exp4_2}
\subfigure[Various index types]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.91\textwidth]{exp/index_exp4_1.pdf}
\end{minipage}
}
\label{fig:index_exp4_1}
\caption{Index construction and storage overhead}
\label{fig:index_exp4}
\end{figure}
Finally, we evaluated the scalability and cost of maintaining the index. Fig.~\ref{fig:index_exp4} compares our method against PostGIS (R-tree), GeoMesa (Z-order), and MSTGI during the ingestion of $7\times 10^5$ images. Fig.~\ref{fig:index_exp4}(a) illustrates the ingestion throughput. PostGIS exhibits a degrading trend as the dataset grows, bottlenecked by the logarithmic cost of R-tree rebalancing. In contrast, GeoMesa maintains a stable and high throughput ($\approx 2500$ img/s) owing to its append-only write pattern in HBase. MSTGI achieves a throughput of $\approx 2350$ img/s—slightly lower than GeoMesa due to the additional overhead of maintaining multi-scale temporal indexes. Our method demonstrates stable throughput at $\approx 2100$ img/s, lower than both GeoMesa and MSTGI due to the dual-table write overhead (G2I + I2G), yet still exhibits linear scalability suitable for high-velocity streaming data. Regarding storage cost (Fig.~\ref{fig:index_exp4}(b)), our index occupies approximately 0.83\% of the raw data size. GeoMesa maintains the lowest storage footprint (0.15\%) by encoding only spatial footprints via Z-order curves. MSTGI incurs moderately higher storage cost as it must maintain multi-scale temporal partitions alongside spatial indexes. While our method's storage overhead is higher than both GeoMesa and PostGIS (0.51\%) due to the storage of pre-materialized grid-window mappings, it remains strictly below the 1\% threshold. This result validates that the proposed method achieves significant performance gains with a negligible storage penalty.
\subsection{Evaluating the Concurrency Control}
In this section, we evaluate the proposed hybrid coordination mechanism on a distributed storage cluster to assess its scalability, robustness under contention, and internal storage efficiency.
To systematically control the workload characteristics, we developed a synthetic workload generator. We define the Spatial Overlap Ratio ($\sigma$) to quantify the extent of shared data regions among concurrent queries, ranging from $\sigma=0$ (disjoint) to $\sigma=0.9$ (highly concentrated hotspots). The number of concurrent clients varies from $N=1$ to $N=64$.
For comparison, we evaluate the following execution schemes:
\begin{enumerate}
\item \textbf{Baseline (Shared Index):} Metadata access is shared, but data retrieval remains uncoordinated, representing the state-of-the-art systems like OpenDataCube.
\item \textbf{Ours:} The proposed mechanism featuring contention-aware switching, global I/O plan ordering, and window merging.
\end{enumerate}
\begin{figure*}[htb]
\centering
\subfigure[$\sigma=0.4$]{\label{fig:cc_exp1_3}
\includegraphics[width=2.1in]{exp/cc_exp1_3.pdf}}
\subfigure[$\sigma=0.6$]{\label{fig:cc_exp1_2}
\includegraphics[width=2.1in]{exp/cc_exp1_2.pdf}}
%\subfigure[]{\label{fig:trans_candidate}
%\includegraphics[width=0.6in]{trans_candidate.eps}}
\subfigure[$\sigma=0.8$]{\label{fig:cc_exp1_1}
\includegraphics[width=2.1in]{exp/cc_exp1_1.pdf}}
%\subfigure[]{\label{fig:diagram3}
%\includegraphics[width=0.7in]{routing.eps}}
\caption{% (a) Illustration of avoiding couplers.
Concurrency scalability analysis under varying spatial overlap ratios ($\sigma$).}
\label{fig:cc_exp1}
\end{figure*}
\subsubsection{Concurrency Scalability}
To evaluate the system's robustness under different workload characteristics, we conducted a sensitivity analysis by manipulating the Spatial Overlap Ratio ($\sigma$). We examined three distinct scenarios: low overlap ($\sigma=0.4$, simulating dispersed random queries), medium overlap ($\sigma=0.6$), and high overlap ($\sigma=0.8$, simulating hotspot analysis). Note that $\sigma=0.4$ is defined as a low overlap scenario because: when $\sigma \le 0.35$, the performance of the deterministic scheduling mode is even lower than that of the optimistic mode (See Sec.~\ref{sec:ModeSwitch}). So, the performance of our method is the same as the Baseline when $\sigma \le 0.35$. Fig.~\ref{fig:cc_exp1} illustrates the query latency trends as the number of concurrent clients increases from 1 to 64.
The results reveal a fundamental divergence in how the two systems respond to data contention. As shown in Fig.~\ref{fig:cc_exp1}(a), when query footprints are spatially dispersed, the opportunities for physical I/O merging are minimal. Consequently, the performance of both systems is primarily constrained by the aggregate physical bandwidth of the storage cluster. Both approaches exhibit linear latency growth with respect to the client count. At $N=64$, the Baseline reaches a mean latency of approx. 37,000 ms, while our method records approx. 30,000 ms. Although our method maintains a slight performance edge due to the reduced read amplification provided by the I/O-aware index, it inevitably degrades to a linear query processing mode similar to the Baseline. This confirms that without spatial locality to leverage request collapsing, the system is bound by the hardware's I/O throughput limits.
A sharp performance divergence is observed as the overlap ratio increases to $\sigma=0.8$ (Fig.~\ref{fig:cc_exp1}(b)). The Baseline suffers from severe performance degradation, with latency spiking from 37,000 ms (at $\sigma=0.2$) to 60,000 ms (at $\sigma=0.8$) under peak load. This deterioration is attributed to the I/O blender effect and lock convoys: highly concurrent requests competing for the same index pages and disk blocks cause excessive disk seek thrashing and thread blocking, significantly reducing effective throughput. Conversely, our method demonstrates sub-linear scalability in this scenario. The latency at $N=64$ drops significantly to 1,100 ms—a $54\times$ speedup over the Baseline. This result validates the efficacy of the \textit{Request Collapse} mechanism. As $\sigma$ increases, the probability of multiple logical queries targeting the same physical byte ranges rises, allowing the scheduler to merge $N$ concurrent requests into a single physical I/O operation.
The medium-overlap scenario (Fig.~\ref{fig:cc_exp1}(c)) serves as a transition point, where our method achieves a mean latency of approx. 6,000 ms at peak load, compared to 40,000 ms for the Baseline. This indicates that the system's efficiency scales dynamically with the degree of data contention. The experimental results demonstrate the workload-adaptive nature of the proposed architecture. While the system performs comparably to traditional approaches under dispersed workloads (limited by physical bandwidth), its advantages become order-of-magnitude significant in data-intensive, high-contention scenarios, effectively turning I/O contention into an opportunity for optimization.
\subsubsection{Storage-Level Effects and Request Collapse}
\begin{figure}[tb]
\centering
\subfigure[The number of clients]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.94\textwidth]{exp/cc_exp3_1.pdf}
\end{minipage}
}
\label{fig:cc_exp3_1}
\subfigure[The number of clients]{
\begin{minipage}[b]{0.227\textwidth}
\includegraphics[width=0.97\textwidth]{exp/cc_exp3_2.pdf}
\end{minipage}
}
\label{fig:cc_exp3_2}
\caption{The data volume reduction and request collapse}
\label{fig:cc_exp3}
\end{figure}
To uncover the cause of the significant latency reduction observed in high-contention scenarios ($\sigma=0.8$), we further analyzed the internal I/O behavior of the system. Specifically, we measured the total volume of physical data transferred from disk and the number of backend I/O requests issued to the storage system. Fig.~\ref{fig:cc_exp3} compares the physical storage pressure between the shared index baseline and our method.
Fig.~\ref{fig:cc_exp3}(a) plots the total physical data read size. The baseline exhibits a strict linear increase in data volume. At $N=64$, the system is forced to fetch 32 GB of data. This confirms that without coordination, logically overlapping queries translate into redundant physical reads, leading to severe bandwidth saturation. In contrast, our approach effectively decouples logical demand from physical execution. Although 64 clients logically request 32 GB of data, the request collapse mechanism merges these overlapping windows, resulting in only 5 GB of actual disk traffic. This 84\% reduction in data volume explains why our system avoids the bandwidth bottleneck.
Fig.~\ref{fig:cc_exp3}(b) illustrates the number of backend I/O requests (IOPS). The baseline generates distinct I/O requests for every client, reaching 12,800 requests at $N=64$. This massive influx of small random reads overwhelms the storage scheduler, causing excessive disk head seek latency. Our system keeps the request count low and stable. Even at peak load ($N=64$), the number of physical I/O requests is suppressed to 2,000.
Fig.~\ref{fig:cc_exp3}(a) and Fig.~\ref{fig:cc_exp3}(b) demonstrate the Request Collapse effect. While 64 concurrent clients generate 12,800 IOPS in the baseline, our system collapses them into fewer than 600 physical operations.
\subsubsection{Deterministic and Non-Deterministic Modes}\label{sec:ModeSwitch}
\begin{figure}
\centering
\includegraphics[width=1.8in]{exp/cc_exp4.pdf}
\caption{Mode Switching}
\label{fig:cc_exp4}
\end{figure}
To validate the necessity of the proposed hybrid switching mechanism, we compared our adaptive approach against two static execution policies: non-deterministic mode and deterministic mode. We swept the spatial overlap ratio from 0 to 0.9 to capture the performance crossover zone. The workload concurrency was fixed at a moderate level ($N=16$) to highlight the sensitivity to spatial contention.
Fig.~\ref{fig:cc_exp4} depicts the average latency curves for the three strategies. The results demonstrate that neither static policy is universally optimal, validating the design of our contention-aware switch. In the low-overlap condition, the non-deterministic mode achieves the lower latency. In contrast, the deterministic mode incurs a higher baseline latency. This performance gap is attributed to the inherent coordination overhead. Even when no conflicts exist, the deterministic scheduler must still perform time-window batching and plan sorting. Consequently, enforcing global ordering for disjoint queries introduces unnecessary serialization latency.
As $\sigma$ increases, the non-deterministic approach suffers from rapid performance deterioration, exhibiting an exponential growth trend. At $\sigma=0.4$, its latency spikes to 146 ms. This degradation is caused by severe lock and I/O contention at the storage layer, where uncoordinated threads compete for the same disk blocks. Conversely, the deterministic mode exhibits stable, linear scalability thanks to its request merging capability, maintaining a latency of 110 ms at $\sigma=0.4$.
Our hybrid approach successfully combines the benefits of both worlds. As shown in Fig.~\ref{fig:cc_exp4}, the hybrid curve (blue line) tightly tracks the lower performance envelope of the two static policies. The system identifies the intersection point at $\sigma \approx 0.34$. When contention is below this threshold, it operates optimistically to minimize latency. Once contention exceeds it, the system automatically engages the deterministic scheduler to prevent thrashing.
\subsection{Evaluating the I/O Tuning}
In this section, we evaluate the effectiveness of the proposed SA-GMAB tuning framework. The experiments are designed to verify four key properties: fast convergence speed, robustness against stochastic noise, adaptability to workload shifts, and tangible end-to-end performance gains.
To comprehensively assess SA-GMAB across different optimization paradigms, we benchmark against five representative tuning strategies spanning heuristic search, probabilistic modeling, simulation-based prediction, and reinforcement learning approaches:
\begin{enumerate}
\item \textbf{Genetic Algorithm (GA):} A canonical evolutionary search method that explores the configuration space through selection, crossover, and mutation operators \cite{Behzad13HDF5}. GA serves as the foundational algorithm in TunIO and represents the baseline heuristic approach.
\item \textbf{Simulated Annealing (SA):} A classical stochastic optimization technique inspired by metallurgical annealing \cite{Chen98SA, Robert20SA}. SA has been widely applied in HPC I/O tuning for over two decades and provides a mature baseline for convergence analysis.
\item \textbf{Bayesian Optimization with TPE:} A model-based sequential optimization method that constructs a surrogate using Tree-structured Parzen Estimators and selects candidates via Expected Improvement \cite{Agarwal19TPE}. TPE represents state-of-the-art probabilistic optimization and achieves rapid convergence in recent HPC I/O studies.
\item \textbf{Random Forest Regression (RF):} A simulation-based approach that trains an ensemble predictor on historical execution logs to rank candidate configurations offline \cite{Bagbaba20RF}. RF drastically reduces tuning time from hours to seconds by avoiding repeated real-system evaluations.
\item \textbf{TunIO:} A recent framework integrating high-impact parameter selection with Reinforcement Learning-driven early stopping \cite{Rajesh24TunIO}. TunIO balances tuning cost and performance in complex HPC I/O stacks and represents the state-of-the-art RL-based approach.
\item \textbf{SA-GMAB (Ours):} The proposed framework combining surrogate modeling with a Genetic Multi-Armed Bandit strategy, explicitly designed to accelerate convergence and handle stochastic performance fluctuations in concurrent workloads.
\end{enumerate}
\subsubsection{Convergence Speed and Tuning Cost}
\begin{figure}
\centering
\includegraphics[width=1.8in]{exp/tune_exp1_1.pdf}
\caption{Efficiency analysis of the tuning framework.}
\label{fig:tune_exp1}
\end{figure}
We conduct a cold-start tuning experiment to evaluate how efficiently each method identifies high-performance I/O configurations from an unoptimized initial state. All methods start from the same default configuration with an initial latency of 834 ms. Each tuning step corresponds to evaluating one candidate configuration on the actual system, and we record the best-observed latency over 100 steps.
\par
Corresponding to the convergence trajectories in Fig.~\ref{fig:tune_exp1}, the six methods exhibit distinct convergence patterns that can be categorized into three groups. SA exhibits the poorest performance, with latency initially surging to 1,009 ms at step~3 before gradually declining to 536 ms. Its non-monotonic acceptance of worse configurations proves detrimental in expensive I/O tuning scenarios. GA demonstrates steady but slow improvement, following a characteristic staircase-like descent with prolonged plateaus. GA requires over 100 steps to reach 394 ms. The mutation operator repeatedly explores ineffective regions, resulting in low information gain per evaluation. RF achieves rapid initial descent, dropping to approximately 480 ms within the first 10 steps and eventually reaching 336 ms. By constructing a surrogate model from historical execution data, RF can rank candidates without direct system evaluation. However, the plateau observed after step~15 suggests that the surrogate's predictive accuracy becomes a bottleneck—the model cannot extrapolate beyond the training distribution, limiting further improvement. BO-TPE exhibits the best performance among model-based methods, converging to 310 ms by step~100. BO-TPE effectively balances exploration and exploitation by maintaining a probabilistic surrogate and selecting candidates via expected improvement.
The RL-based TunIO outperforms above baselines but still suffers from a slow start. While it eventually reaches a competitive latency ($\approx 266$ ms at step~71), its exploration phase is costly. The RL agent requires a substantial number of interaction samples to learn the complex mapping between I/O parameters and reward signals. Our method achieves the fastest latency drop, rapidly decreasing from initial latency to a near-optimal zone ($\approx 277$ ms) within a short time. SA-GMAB leverages the surrogate model to pre-screen candidates. Its permanent memory mechanism enables more efficient candidate pruning, making it particularly suitable for online scenarios where tuning overhead must be minimized.
\subsubsection{Adaptation to Workload Shifts}
\begin{figure}
\centering
\includegraphics[width=1.8in]{exp/tune_exp3_1.pdf}
\caption{Mode Switching}
\label{fig:tune_exp3}
\end{figure}
We further investigated the system's resilience in non-stationary environments. We introduced a sudden workload shift at tuning step $t=60$, drastically changing the query pattern, from sparse random access to dense sequential scans to invalidate the previously learned optimal parameters.
Fig.~\ref{fig:tune_exp3} illustrates the latency evolution before and after the shift. At $t=60$, the workload transition causes an immediate performance collapse across all methods, with latency spiking from a stable $\approx 50$ ms to $>300$ ms. This confirms that the configuration optimal for the previous phase is detrimental in the new environment. The GA-based method fails to adapt effectively. Post-shift, its latency hovers around $290-300$ ms. Lacking a mechanism to quickly reset or guide exploration, the genetic algorithm remains trapped in the local optima of the previous workload, exhibiting almost zero recovery within the observation window. TunIO manages to reduce latency but at a slow pace. It takes 40 steps to lower the latency from 308 ms to 134 ms ($t=100$). While the RL agent eventually learns the new reward function, the high sample complexity delays the recovery, leaving the system in a suboptimal state for a prolonged period. In contrast, SA-GMAB executes a decisive recovery. By leveraging the surrogate model to filter high-uncertainty candidates, it rapidly identifies the new optimal region. The latency drops to $\approx 88$ ms at $t=80$ and further stabilizes at $\approx 74$ ms at $t=100$.
\section{Conclusions}\label{sec:Con}
This paper presents an I/O-aware retrieval approach designed to bound retrieval latency and maximize throughput for large-scale spatio-temporal analytics. By introducing the ``Index-as-an-Execution-Plan" paradigm, the dual-layer inverted index bridges the semantic gap between logical indexing and physical storage, effectively shifting the computational burden from retrieval time to ingestion time. To address the scalability challenges in concurrent environments, we developed a hybrid concurrency-aware I/O coordination protocol that adaptively switches between deterministic ordering and optimistic execution based on spatial contention. Furthermore, to handle the complexity of parameter configuration in fluctuating workloads, we integrated the SA-GMAB method for online automatic I/O tuning. The experimental results indicate that: (1) I/O-aware indexing achieves an order-of-magnitude latency reduction with negligible storage overhead; (2) the hybrid coordination protocol realizes a $54\times$ throughput improvement in high-overlap scenarios; and (3) the SA-GMAB method recovers from workload shifts $2\times$ faster than RL baselines while maximizing RoTI.
\section*{Acknowledgments}
This work is supported by the National Key R\&D Program of China ``Intergovernmental International Science and Technology Innovation Cooperation" (Grant No.2025YFE0107100).
\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
\bibliography{IEEEabrv,references}
%
\vfill
\end{document}