Current adversarial patch generation methods for Vision-Language-Action (VLA) models predominantly optimize perturbations over static, isolated 2D images. This static assumption fundamentally ignores the dynamic realities of embodied, closed-loop robotic systems. During real-world execution, an adversarial patch is continuously subjected to macroscopic kinematic shifts (e.g., motion distortion, scaling, and perspective changes as the camera moves) and microscopic hardware degradations (e.g., CMOS sensor noise, ambient lighting jitter, and printer color gamut clipping). By ignoring these physical factors, traditional static 2D optimizations converge into brittle solutions that suffer from Adversarial Flickering, a phenomenon where a patch succeeds in a pristine digital frame but fails instantly the moment the robot arm moves.
To bridge this critical sim-to-real gap, we propose a novel framework that systematically embeds physical-world dynamics into the 2D optimization pipeline (as shown in Figure 2). Our method mathematically decouples physical robustness into three tractable objectives: simulating macroscopic motion dynamics (Section 2.1), sculpting flat minima to absorb microscopic hardware noise (Section 2.2), and bounding the optimization with strict physical fabricability constraints to guarantee zero-shot real-world transferability (Section 2.3).
Traditional adversarial attacks generate patches by overfitting to pristine, static viewpoints. The moment the robotic arm initiates a trajectory, the resulting perspective distortion (changes in scale, viewing angle, and spatial warping) instantly breaks the adversarial alignment. To guarantee robustness against these macroscopic kinematic shifts without leaving the efficient 2D domain, we employ Expectation over Transformation (EoT) across offline workspace datasets (see Figure 2 (a)).
Formally, the physically augmented observation fed into the VLA is defined as:
\begin{equation} o(x, \delta, \tau) = (1 - M_\tau) \odot x + M_\tau \odot \mathcal{T}(\delta, \tau), \end{equation}where $x \in \mathbb{R}^{H \times W \times 3}$ is a clean, benign camera frame sampled from an offline dataset (e.g., LIBERO), and $\delta$ represents the learnable parameters of the 2D adversarial patch. To simulate the continuous perspective distortions and motion artifacts caused by the robot's movement, we sample a spatial transformation $\tau$ from a predefined physical distribution $\Gamma$ (encompassing random affine scaling, perspective warping, rotation, and brightness shifts). The function $\mathcal{T}(\delta, \tau)$ applies this macroscopic transformation to the digital patch $\delta$, while the operator $\odot$ denotes the Hadamard (element-wise) product.
Crucially, the dynamically transformed binary mask $M_\tau \in \{0,1\}^{H \times W}$ is not merely a spatial bounding box; it is the mathematical guarantor of physical realizability. To distinguish a legitimate localized physical attack from an impossible global digital perturbation, $M_\tau$ enforces two strict physical constraints:
Given this strictly bounded, physically augmented observation, we formulate the base targeted manipulation loss:
\begin{equation} \mathcal{L}_{base}(\delta, x, \tau) = \mathcal{L}_{CE}\Big( \mathcal{F}_{VLA}\big(o(x, \delta, \tau), I\big), \; A_{target} \Big), \end{equation}where $\mathcal{F}_{VLA}(\cdot)$ acts as a frozen white-box Vision-Language-Action model; its internal weights are fixed, but full access to its computational graph is maintained to backpropagate gradients. $\mathcal{L}_{CE}$ denotes the Cross-Entropy loss, which quantifies the discrepancy between the VLA's predicted action distribution (conditioned on the natural language instruction $I$) and the attacker's malicious 7-DoF target $A_{target}$.
While EoT addresses macroscopic viewpoint changes, standard empirical risk minimization over static datasets inevitably drives the patch parameters into a "sharp minimum." In the physical world, cameras do not capture perfect pixel matrices. Unmodeled microscopic variations (e.g., CMOS camera sensor grain, ambient lighting fluctuations, and slight printer ink bleeding) act as random noise vectors that instantly eject the patch from this brittle minimum, neutralizing the attack.
To actively sculpt a globally robust "vulnerability basin" (i.e., a flat minimum) that immunizes the patch against these unpredictable sensor and environmental noises (as shown in Figure 2 (b)), we introduce Physical Sharpness-Aware Minimization (P-SAM). We enforce a constraint that the patch must remain aggressively malicious even when subjected to the absolute worst-case microscopic pixel degradation. This is formulated as a minimax objective:
\begin{equation} \min_{\delta} \mathbb{E}_{x, \tau} \left[ \max_{\|\epsilon\|_2 \le \rho} \mathcal{L}_{base}(\delta + \epsilon, x, \tau) \right], \end{equation}where $\epsilon$ is a microscopic, learnable perturbation applied strictly to the patch pixels, acting as a mathematical surrogate for unmodeled physical hardware noise. The hyperparameter $\rho$ defines the $\ell_2$-norm radius of the perturbation ball, effectively controlling the desired width of the vulnerability basin.
Solving the exact inner maximization iteratively is computationally intractable for deep neural networks. However, utilizing the white-box gradients of the VLA, we can efficiently approximate the conditionally worst-case noise $\epsilon^*$ via a first-order Taylor expansion, requiring only a single additional backward pass per iteration:
\begin{equation} \epsilon^*(x, \tau) = \rho \frac{\nabla_\delta \mathcal{L}_{base}(\delta, x, \tau)}{\|\nabla_\delta \mathcal{L}_{base}(\delta, x, \tau)\|_2}, \end{equation}where $\nabla_\delta \mathcal{L}_{base}$ denotes the standard gradient of the base loss with respect to the patch parameters. Crucially, this worst-case pixel noise $\epsilon^*(x, \tau)$ is strictly conditioned on the specific background image $x$ and the sampled macroscopic perspective $\tau$.
Instead of updating the patch using the gradient at its current state $\delta_k$, P-SAM computes the gradient at this worst-case perturbed state. The patch parameters are thus updated as:
\begin{equation} \delta_{k+1} = \delta_k - \alpha \cdot \mathbb{E}_{(x, \tau) \sim \mathcal{B}} \left[ \nabla_\delta \mathcal{L}_{base}\big(\delta_k + \epsilon^*(x, \tau), x, \tau\big) \right], \end{equation}where $\mathcal{B}$ is the current mini-batch of sampled image-transformation pairs and $\alpha$ is the optimization learning rate.
Geometric Interpretation of Global Flatness: Because the local loss landscape shifts continuously with respect to the macroscopic viewpoint $\tau$, the worst-case perturbation $\epsilon^*$ is highly dynamic. By computing the conditional worst-case noise for each $(x, \tau)$ pair independently in the mini-batch and averaging their respective perturbed gradients, the patch $\delta$ is pulled along a trajectory that simultaneously minimizes the loss against an ensemble of viewpoint-specific worst-case degradations. This dynamic acts as a mathematical repellent from sharp cliffs, naturally sculpting a globally flat, noise-resilient vulnerability basin (see Figure 5).
Finally, a mathematically optimal digital patch is fundamentally useless if it cannot be accurately reproduced in the physical world. Static 2D optimization methods inherently assume perfect digital-to-physical pixel fidelity, frequently generating neon, high-frequency pixel static. Such patterns not only exceed the color gamut of physical printers (resulting in severe color clipping) but are also instantly destroyed by the motion blur inherent to a moving robotic camera. To bridge this final deployment gap, we enforce two strict physical regularizers (as shown in Figure 2 (c)).
First, to penalize high-frequency color transitions and ensure the adversarial signal survives motion blur and camera down-sampling, we apply Total Variation (TV) regularization:
\begin{equation} \mathcal{L}_{TV}(\delta) = \sum_{i,j} \sqrt{(\delta_{i+1,j} - \delta_{i,j})^2 + (\delta_{i,j+1} - \delta_{i,j})^2}, \end{equation}which guarantees organic, smooth gradients across the spatial coordinates $(i,j)$ of the patch.
Second, to ensure the digital pixels can be faithfully reproduced by physical ink without distortion, we apply the Non-Printability Score (NPS):
\begin{equation} \mathcal{L}_{NPS}(\delta) = \sum_{p \in \delta} \min_{c \in C} \|p - c\|_2, \end{equation}which strictly binds every digital pixel vector $p$ to the closest matching color $c$ within a predefined, physically reproducible CMYK printer color gamut $C$. This prevents unpredictable out-of-gamut color clipping during fabrication.
Final Unified Objective. Integrating the macroscopic kinematic simulation (EoT), the microscopic flat-minima optimization (P-SAM), and the physical fabricability constraints, the complete, end-to-end objective function optimized by the attacker is:
\begin{equation} \mathcal{L}_{final}(\delta) = \mathbb{E}_{(x, \tau) \sim \mathcal{B}} \Big[ \mathcal{L}_{base}\big(\delta + \epsilon^*(x, \tau), x, \tau\big) \Big] + \lambda_1 \mathcal{L}_{TV}(\delta) + \lambda_2 \mathcal{L}_{NPS}(\delta), \end{equation}where $\lambda_1$ and $\lambda_2$ are empirically tuned weighting coefficients that balance the targeted kinematic hijacking success rate against the stealth and physical printability of the adversarial patch. The full optimization procedure is summarized in Algorithm 1. By optimizing Equation 8 entirely offline, the attacker bypasses the need for computationally prohibitive 3D simulations or real-time queries to the target robot. The resulting P-SAM patch acts as a condensed "vulnerability basin," inherently robust to both macroscopic kinematic shifts and microscopic physical degradations.