添加索引实验的baseline，图片

2026-02-03 21:06:13 +08:00
parent d64f885e16
commit c063a5599d
16 changed files with 338 additions and 276 deletions
--- a/rs_retrieval.tex
+++ b/rs_retrieval.tex
@@ -81,11 +81,12 @@ To address the aforementioned problems, we propose a novel ``Index-as-an-Executi

 The remainder of this paper is organized as follows:
 Section~\ref{sec:RW} presents the related work.
-Section~\ref{sec:DF} proposes the definition concerning the spatio-temporal range retrieval problem.
-Section~\ref{sec:Index} proposes the indexing structure.
-Section~\ref{sec:CC} proposes the hybrid concurrency control protocol.
-Section~\ref{sec:Tuning} proposes the method of I/O stack tuning.
-Section~\ref{sec:EXP}  presents the experiments and results.
+Section~\ref{sec:DF} formulates the spatio-temporal range retrieval problem and establishes the cost models.
+Section~\ref{sec:Overview} provides an overview of the proposed framework and describes how the three modules are integrated.
+Section~\ref{sec:Index} presents the I/O-aware indexing structure.
+Section~\ref{sec:CC} proposes the hybrid concurrency-aware I/O coordination protocol.
+Section~\ref{sec:Tuning} presents the GMAB-based online I/O stack tuning method.
+Section~\ref{sec:EXP} presents the experiments and results.
 Section~\ref{sec:Con} concludes this paper with a summary.

 \section{Related Work}\label{sec:RW}
@@ -145,7 +146,7 @@ Each retrieval $Q_i$ independently specifies a spatio-temporal window $\langle S
 \vspace{-0.05in}
 \begin{equation}
 	\label{eqn_pre_objective}
-	\min \sum_{Q_i\in \mathcal{Q}}{\left( C_{meta}\left( Q_i \right) +\sum_{R\in \mathcal{R}_{Q_i}}{\left( C_{geo}\left( R,Q_i \right) +C_{io}\left( R,Q_i \right) \right)} \right)},
+	\min \sum_{Q_i\in \mathcal{Q}} \bigl( C_{\text{meta}}(Q_i)+\sum_{R\in \mathcal{R}_{Q_i}} \bigl( C_{\text{geo}}(R,Q_i) + C_{\text{io}}(R,Q_i) \bigr) \bigr),
 \end{equation}
 subject to:
 \begin{enumerate}
@@ -153,6 +154,18 @@ subject to:
 	\item \textit{Isolation:} Concurrent reads must effectively share I/O bandwidth without causing starvation or excessive thrashing.
 \end{enumerate}

+\section{System Overview}\label{sec:Overview}
+\begin{figure}
+	\centering
+	\includegraphics[width=2.2in]{fig/overview.png}
+	\caption{The workflow for processing concurrent spatio-temporal range retrievals in the system}
+	\label{fig:overview}
+\end{figure}
+
+To address the challenges of storage-level I/O contention and expensive runtime computations, we propose a layered distributed retrieval framework. As illustrated in Fig. \ref{fig:overview}, the system architecture is composed of four primary processing components: (1) \emph{requst interface}, (2) \emph{index manager}, (3) \emph{I/O coordinator}, (4) \emph{parallel executors}, and (5) \emph{adaptive tuner}.
+
+The $\emph{requst interface}$ serves as the system entry point. It is responsible for accepting concurrent spatio-temporal retrievals. The $\emph{index manager}$ acts as the planner of the system, interacting with the metadata storage. It translates logical spatio-temporal predicates into physical storage locations using a dual-layer inverted index. The $\emph{I/O coordinator}$ serves as the traffic control layer. It detects spatial overlaps among concurrent reading plans to identify potential I/O conflicts and applies the hybrid concurrency-aware protocol to reorder or merge conflicting requests. Finally, the $\emph{parallel executors}$ interface with the distributed file system or object store to read the pixel data. What's more, \emph{adaptive tuner} optimizes the execution parameters in the background.
+
 \section{I/O-aware Indexing Structure}\label{sec:Index}
 This section introduces the details of the indexing structure for spatio-temporal range retrieval over RS data.

@@ -452,6 +465,8 @@ To evaluate the system performance under diverse scenarios, we developed a synth
 	\item Concurrency \& Contention: The number of concurrent clients $N$ varies from 1 to 64. To test the coordination mechanism, we control the Spatial Overlap Ratio $\sigma \in [0, 0.9]$ to simulate workloads ranging from disjoint access to highly concentrated hotspots.
 \end{itemize}

+It is worth noting that, given the data-intensive nature of retrievals where a single request triggers GB-scale I/O and complex decoding, 64 concurrent streams are sufficient to fully saturate the aggregate I/O bandwidth and CPU resources of our experimental cluster. With 8 worker nodes connected via 10GbE, a concurrency of 64 implies an average of 8 heavy I/O threads per node. Previous characterization studies on Lustre-based supercomputers \cite{Xie12supercomputer} have revealed that client-side flow control typically limits in-flight RPCs to 8 concurrent requests and that exceeding this parallelism level exacerbates resource contention and straggler effects. Therefore, this setting represents a realistic heavy-load scenario where I/O interference significantly impacts performance.
+
 \subsubsection{Experimental Environment}
 \label{sec_exp_env}
 All experiments are conducted on a cluster with 9 homogenous nodes (1 master node and 8 worker nodes). The cluster is connected via a 10Gbps high-speed Ethernet to ensure that network bandwidth is not the primary bottleneck compared to storage I/O. Table \ref{table_config} lists the detailed hardware and software configurations. The I/O-aware index (G2I/I2G) is deployed on HBase, while the raw image data is served by the Lustre parallel file system.
@@ -531,16 +546,23 @@ First, we evaluated the effectiveness of data reduction by measuring the I/O sel
 		\end{minipage}
 	}
 	\label{fig:index_exp2_1}
-	\subfigure[Various baselines]{
+	\subfigure[Query footprint ratios]{
 		\begin{minipage}[b]{0.227\textwidth}
 			\includegraphics[width=0.95\textwidth]{exp/index_exp2_2.pdf}
 		\end{minipage}
 	}
 	\label{fig:index_exp2_2}
-	\caption{End-to-End retrieval latency and latency breakdown}
+	\caption{End-to-End retrieval latency}
 	\label{fig:index_exp2}
 \end{figure}

+\begin{figure}
+	\centering
+	\includegraphics[width=1.8in]{exp/index_exp2_3.pdf}
+	\caption{Latency breakdown}
+	\label{fig:index_exp2_3}
+\end{figure}
+
 We next measured the end-to-end retrieval latency to verify whether the I/O reduction translates into time efficiency. Fig.~\ref{fig:index_exp2}(a) reports the mean and 95th percentile (P95) latency across varying retrieval footprint ratios. The results reveal three distinct performance behaviors: Baseline 1 shows a high and flat latency curve ($\approx 4500$ ms), dominated by the cost of transferring entire images. Baseline 2, despite its optimal I/O selectivity, exhibits a significant latency floor ($\approx 380$ ms for small tile-level retrievals). This overhead stems from the on-the-fly geospatial computations required to calculate precise read windows. Ours achieves the lowest latency, ranging from 34 ms to 59 ms for typical tile-level retrievals. Crucially, for small-to-medium retrievals, our method outperforms Baseline 2 by an order of magnitude. The gap between the two curves highlights the advantage of our deterministic indexing approach: by pre-materializing grid-to-window mappings, we eliminate runtime coordinate transformations. Although our I/O volume is slightly larger (as shown in Sec.~\ref{sec:Index_exp_1}), the time saved by avoiding computational overhead far outweighs the cost of transferring a few extra kilobytes of padding data.

 To empirically validate the cost model proposed in Eq.~\ref{eqn:cost_total}, we further decomposed the retrieval latency into three components: metadata lookup ($C_{meta}$), geospatial computation ($C_{geo}$), and I/O access ($C_{io}$). Fig.~\ref{fig:index_exp2}(b) presents the time consumption breakdown for a representative medium-scale retrieval (involving approx. 50 image tiles). As expected, the latency of Baseline 1 is entirely dominated by $C_{io}$, rendering $C_{meta}$ and $C_{geo}$ negligible. The massive data transfer masks all other overheads. While $C_{io}$ of Baseline 2 is successfully reduced to the window size, a new bottleneck emerges in $C_{geo}$. The runtime coordinate transformations and polygon clipping consume nearly $40\%$ of the total execution time ($\approx 350 ms$). This observation confirms our theoretical analysis that window-based I/O shifts the bottleneck from storage to CPU. The proposed method exhibits a balanced profile. Although $C_{meta}$ increases slightly ($\approx 35 ms$) due to the two-phase index lookup (G2I + I2G), this cost is well-amortized. Crucially, $C_{geo}$ is effectively eliminated thanks to the pre-computed grid-window mappings. Consequently, our approach achieves a total latency of 580 ms, providing a $1.7\times$ speedup over Baseline 2 by removing the computational bottleneck without regressing on I/O performance.
@@ -601,7 +623,7 @@ Finally, we evaluated the scalability and cost of maintaining the index. Fig.~\r
 \subsection{Evaluating the Concurrency Control}
 In this section, we evaluate the proposed hybrid coordination mechanism on a distributed storage cluster to assess its scalability, robustness under contention, and internal storage efficiency.

-To systematically control the workload characteristics, we developed a synthetic workload generator. We define the Spatial Overlap Ratio ($\sigma$) to quantify the extent of shared data regions among concurrent queries, ranging from $\sigma=0$ (disjoint) to $\sigma=0.9$ (highly concentrated hotspots). The number of concurrent clients varies from $N=1$ to $N=64$. It is worth noting that, given the data-intensive nature of retrievals where a single request triggers GB-scale I/O and complex decoding, 64 concurrent streams are sufficient to fully saturate the aggregate I/O bandwidth and CPU resources of our experimental cluster, representing a heavy-load scenario in operational scientific computing environments.
+To systematically control the workload characteristics, we developed a synthetic workload generator. We define the Spatial Overlap Ratio ($\sigma$) to quantify the extent of shared data regions among concurrent queries, ranging from $\sigma=0$ (disjoint) to $\sigma=0.9$ (highly concentrated hotspots). The number of concurrent clients varies from $N=1$ to $N=64$.

 For comparison, we evaluate the following execution schemes:
 \begin{enumerate}