添加调优baseline

2026-02-11 11:09:40 +08:00
parent 1bd2f32b09
commit 73c1aca15b
10 changed files with 186 additions and 109 deletions
--- a/rs_retrieval.tex
+++ b/rs_retrieval.tex
@@ -531,7 +531,7 @@ The comparative methods are categorized as follows:

 	\item \textbf{OpenDataCube (Window-based I/O):} A state-of-the-art data cube system that couples PostGIS indexes with windowed I/O via rasterio, enabling partial reads from monolithic image files. By leveraging GeoBox-based ROI computation and automatic overview selection, OpenDataCube represents the theoretical optimum for I/O selectivity but incurs runtime geospatial computation overhead to resolve pixel-to-geographic mappings.

-	\item \textbf{rio-tiler (Window-based I/O):} A lightweight raster reading engine optimized for dynamic tile generation. Similar to OpenDataCube, it employs PostGIS for spatial indexing and windowed I/O for partial data access, but features a streamlined execution path with minimal abstraction layers, resulting in lower per-query overhead. rio-tiler serves as a high-performance baseline for windowed reading without the complexity of full data cube management.
+	\item \textbf{Rio-tiler (Window-based I/O):} A lightweight raster reading engine optimized for dynamic tile generation. Similar to OpenDataCube, it employs PostGIS for spatial indexing and windowed I/O for partial data access, but features a streamlined execution path with minimal abstraction layers, resulting in lower per-query overhead. rio-tiler serves as a high-performance baseline for windowed reading without the complexity of full data cube management.

 	\item \textbf{Ours (I/O-aware Indexing):} The proposed approach leverages a dual-layer inverted index structure comprising Grid-to-Image (G2I) and Image-to-Grid (I2G) mappings. By pre-materializing grid-to-pixel correspondences at ingestion time, our method translates spatio-temporal predicates directly into byte-level read plans, completely eliminating runtime geometric computations while preserving minimal I/O volume through precise windowed access.
 \end{enumerate}
@@ -727,46 +727,36 @@ Our hybrid approach successfully combines the benefits of both worlds. As shown
 \subsection{Evaluating the I/O Tuning}
 In this section, we evaluate the effectiveness of the proposed SA-GMAB tuning framework. The experiments are designed to verify four key properties: fast convergence speed, robustness against stochastic noise, adaptability to workload shifts, and tangible end-to-end performance gains.

-For comparison, we benchmark against three representative tuning strategies: 
+To comprehensively assess SA-GMAB across different optimization paradigms, we benchmark against five representative tuning strategies spanning heuristic search, probabilistic modeling, simulation-based prediction, and reinforcement learning approaches:

-\begin{enumerate} 
-	\item \textbf{Genetic algorithm (GA):} The standard genetic algorithm to explore the configuration space, serving as the basic algorithm in the TunIO.
-	\item \textbf{TunIO:} A state-of-the-art framework that integrates high-impact parameter selection and Reinforcement Learning (RL)-driven early stopping to balance tuning cost and performance in complex HPC I/O stacks.
-	\item \textbf{SA-GMAB (Ours):} The proposed framework combining surrogate modeling with a Genetic Multi-Armed Bandit strategy, explicitly designed to accelerate convergence and handle the stochastic performance fluctuations of concurrent workloads.
+\begin{enumerate}
+	\item \textbf{Genetic Algorithm (GA):} A canonical evolutionary search method that explores the configuration space through selection, crossover, and mutation operators \cite{Behzad13HDF5}. GA serves as the foundational algorithm in TunIO and represents the baseline heuristic approach.
+
+	\item \textbf{Simulated Annealing (SA):} A classical stochastic optimization technique inspired by metallurgical annealing \cite{Chen98SA, Robert20SA}. SA has been widely applied in HPC I/O tuning for over two decades and provides a mature baseline for convergence analysis.
+
+	\item \textbf{Bayesian Optimization with TPE:} A model-based sequential optimization method that constructs a surrogate using Tree-structured Parzen Estimators and selects candidates via Expected Improvement \cite{Agarwal19TPE}. TPE represents state-of-the-art probabilistic optimization and achieves rapid convergence in recent HPC I/O studies.
+
+	\item \textbf{Random Forest Regression (RF):} A simulation-based approach that trains an ensemble predictor on historical execution logs to rank candidate configurations offline \cite{Bagbaba20RF}. RF drastically reduces tuning time from hours to seconds by avoiding repeated real-system evaluations.
+
+	\item \textbf{TunIO:} A recent framework integrating high-impact parameter selection with Reinforcement Learning-driven early stopping \cite{Rajesh24TunIO}. TunIO balances tuning cost and performance in complex HPC I/O stacks and represents the state-of-the-art RL-based approach.
+
+	\item \textbf{SA-GMAB (Ours):} The proposed framework combining surrogate modeling with a Genetic Multi-Armed Bandit strategy, explicitly designed to accelerate convergence and handle stochastic performance fluctuations in concurrent workloads.
 \end{enumerate}

 \subsubsection{Convergence Speed and Tuning Cost}
-\begin{figure}[tb]
+\begin{figure}
 	\centering
-	\subfigure[Tuning steps]{
-		\begin{minipage}[b]{0.227\textwidth}
-			\includegraphics[width=0.94\textwidth]{exp/tune_exp1_1.pdf}
-		\end{minipage}
-	}
-	\label{fig:tune_exp1_1}
-	\subfigure[Time (mins)]{
-		\begin{minipage}[b]{0.227\textwidth}
-			\includegraphics[width=0.97\textwidth]{exp/tune_exp1_2.pdf}
-		\end{minipage}
-	}
-	\label{fig:tune_exp1_2}
+	\includegraphics[width=1.8in]{exp/tune_exp1_1.pdf}
 	\caption{Efficiency analysis of the tuning framework.}
 	\label{fig:tune_exp1}
 \end{figure}

-We first initiated a cold-start tuning session to evaluate how efficiently each method identifies high-quality configurations starting from a default, unoptimized state. Fig.~\ref{fig:tune_exp1}(a) reports the convergence trajectory of the best-observed latency over tuning steps.
+We conduct a cold-start tuning experiment to evaluate how efficiently each method identifies high-performance I/O configurations from an unoptimized initial state. All methods start from the same default configuration with an initial latency of 834 ms. Each tuning step corresponds to evaluating one candidate configuration on the actual system, and we record the best-observed latency over 100 steps.

-As illustrated in Fig.~\ref{fig:tune_exp1}(a), the three methods exhibit distinct search behaviors. The GA baseline demonstrates the slowest convergence. It exhibits a staircase-like descent with prolonged plateaus, requiring over 100 steps to reduce latency significantly. This sluggishness is attributed to its mutation mechanism, which lacks historical memory and repeatedly explores ineffective parameter spaces. The RL-based TunIO outperforms GA but still suffers from a slow start. While it eventually reaches a competitive latency ($\approx 277$ ms at step 140), its exploration phase is costly. The reinforcement learning agent requires a substantial number of interaction samples to learn the complex mapping between I/O parameters and reward signals. Our method achieves the fastest latency drop, rapidly decreasing from $500$ ms to a near-optimal zone ($\approx 315$ ms) within a short window. Unlike GA and TunIO, SA-GMAB leverages the surrogate model to pre-screen candidates. By effectively pruning unpromising configurations before they incur actual execution costs, SA-GMAB maximizes the information gain per step, making it particularly suitable for online scenarios where tuning overhead must be minimized.
+\par
+Corresponding to the convergence trajectories in Fig.~\ref{fig:tune_exp1}, the six methods exhibit distinct convergence patterns that can be categorized into three groups. SA exhibits the poorest performance, with latency initially surging to 1,009 ms at step~3 before gradually declining to 536 ms. Its non-monotonic acceptance of worse configurations proves detrimental in expensive I/O tuning scenarios.  GA demonstrates steady but slow improvement, following a characteristic staircase-like descent with prolonged plateaus. GA requires over 100 steps to reach 394 ms. The mutation operator repeatedly explores ineffective regions, resulting in low information gain per evaluation. RF achieves rapid initial descent, dropping to approximately 480 ms within the first 10 steps and eventually reaching 336 ms. By constructing a surrogate model from historical execution data, RF can rank  candidates without direct system evaluation. However, the plateau observed after step~15 suggests that the surrogate's predictive accuracy becomes a bottleneck—the model cannot extrapolate beyond the training distribution, limiting further improvement. BO-TPE exhibits the best performance among model-based methods, converging to 310 ms by step~100. BO-TPE effectively balances exploration and exploitation by maintaining a probabilistic surrogate and selecting candidates via expected improvement.

-To strictly quantify the cost-effectiveness of the tuning process, we adopt the \textit{Return on Tuning Investment} (RoTI) metric proposed in TunIO \cite{Rajesh24TunIO}. We define the application performance $\mathcal{P}$ as the reciprocal of the query latency (i.e., $\mathcal{P} \propto 1/\mathcal{L}$). The RoTI metric is formalized as follows:
-
-\begin{equation}
-	\label{eq:roti}
-	RoTI(t) = \frac{\mathcal{P}_{achieved}(t) - \mathcal{P}_{initial}}{t},
-\end{equation}
-where $t$ denotes the cumulative tuning time (overhead). $\mathcal{P}_{initial} = 1 / \mathcal{L}_{0}$ represents the baseline performance derived from the default configuration, and $\mathcal{P}_{achieved}(t) = 1 / \mathcal{L}_{t}$ represents the maximum performance achieved up to time $t$. Functionally, this metric represents the performance gain purchased per unit of tuning time. A higher RoTI value signifies that the optimizer rapidly identifies low-latency configurations with minimal computational overhead.
-
-Fig.~\ref{fig:tune_exp1}(b) plots the RoTI curves over time. Our method (SA-GMAB) reaches a remarkable RoTI peak ($\approx 100$) at the early stage ($t=825$). This indicates that SA-GMAB yields the highest immediate return on investment, successfully locating high-quality configurations when the tuning budget is strictly limited. In contrast, TunIO peaks at a significantly lower value ($\approx 68$), while GA remains flat and inefficient ($\approx 46$). This confirms that the surrogate-assisted mechanism effectively amplifies the value of each exploration step. All curves exhibit a decaying trend as time progresses ($t \rightarrow \infty$). This is expected behavior: as the system converges to the global optimum, the marginal performance gain ($\Delta \mathcal{P}$) saturates while the accumulated time $t$ continues to grow. Notably, SA-GMAB's RoTI decays faster in the late stages simply because it has already exhausted the potential for improvement much earlier than the baselines.
+The RL-based TunIO outperforms above baselines but still suffers from a slow start. While it eventually reaches a competitive latency ($\approx 266$ ms at step~71), its exploration phase is costly. The RL agent requires a substantial number of interaction samples to learn the complex mapping between I/O parameters and reward signals. Our method achieves the fastest latency drop, rapidly decreasing from initial latency to a near-optimal zone ($\approx 277$ ms) within a short time. SA-GMAB leverages the surrogate model to pre-screen candidates. Its permanent memory mechanism enables more efficient candidate pruning, making it particularly suitable for online scenarios where tuning overhead must be minimized.

 \subsubsection{Adaptation to Workload Shifts}
 \begin{figure}