editorially independent. We may make money when you click on links
to our partners.
Learn More
A newly discovered memory corruption vulnerability in vLLM could let attackers crash servers or execute arbitrary code by sending malicious prompt embeddings to the Completions API.
The flaw affects vLLM versions 0.10.2 and later, placing production AI deployments at immediate risk.
“The vulnerability allows any user with access to the API to potentially achieve denial-of-service and remote code execution in the vLLM server process,” said Wiz Security researchers.
Understanding the Root Cause of the vLLM Bug
The vulnerability (CVE-2025-62164) stems from the way vLLM processes user-supplied prompt embeddings, a capability intended to let advanced applications pass precomputed vectors directly to the model.
When a client sends an embedding to the Completions API, vLLM attempts to reconstruct the tensor by deserializing the Base64-encoded payload using PyTorch’s torch.load() function.
The issue occurs in entrypoints/renderer.py, where vLLM decodes the Base64-encoded embedding and deserializes it using torch.load().
After loading the tensor, the server immediately converts it to a dense tensor with to_dense() — and it does so without performing any integrity or safety checks.
In this part of the code, vLLM simply takes the user-supplied embedding, loads it through torch.load(io.BytesIO(pybase64.b64decode(embed, validate=True)), weights_only=True, map_location=torch.device(“cpu”)), and then executes tensor = tensor.to_dense() with no validation in between.
Because vLLM assumes the tensor is valid and safe at this point, any maliciously crafted payload can pass through the deserialization step and cause memory corruption during densification, enabling denial-of-service or potentially remote code execution.
The PyTorch Change That Opened the Door
In PyTorch 2.8.0, the framework disabled sparse tensor integrity checks by default, removing safeguards that previously verified index bounds, tensor shape consistency, and the internal invariants required before calling to_dense().
These checks now have to be explicitly re-enabled using torch.sparse.check_sparse_tensor_invariants, but vLLM does not implement this protection.
As a result, an attacker can create a malformed sparse tensor with internal indices that point outside expected memory ranges.
PyTorch will still load the tensor successfully, but when vLLM later calls to_dense(), the framework attempts to fully materialize the malformed tensor into dense memory, causing an out-of-bounds write and enabling potential memory corruption.
What Attackers Can Do With This vLLM Flaw
Depending on how the malicious payload is crafted, the out-of-bounds write can lead to several serious consequences.
It may crash the server and cause a denial-of-service (DoS) condition if critical execution memory is corrupted.
In more advanced cases, an attacker could achieve arbitrary code execution by overwriting memory regions that influence control flow.
This also opens the door to lateral compromise within the AI stack, since vLLM often runs alongside sensitive components such as GPUs, model weights, logs, or proprietary data.
Because the vulnerable deserialization path is exposed through the public-facing Completions API, an attacker does not need elevated privileges or prior access — they only need the ability to send embedding payloads to the server.
Inside the Deserialization Flaw
This exploit chain is a case of unsafe deserialization, where untrusted input is reconstructed directly into in-memory objects. In vLLM’s case, the risk is amplified because:
- Tensor deserialization is complex and memory-intensive.
- The embedding payload travels through no validation layer.
- The underlying library (PyTorch) silently allows malformed data.
In short, the system takes attacker-controlled bytes, reconstructs them into a sparse tensor, and then tells PyTorch to expand that tensor into dense memory — all without confirming that the tensor follows required invariants.
Key Steps to Mitigate the vLLM Vulnerability
Given the severity of the vLLM deserialization flaw, security teams must adopt a layered mitigation strategy to reduce the risk of server compromise.
- Update to the patched vLLM version and apply PyTorch’s sparse tensor integrity checks to prevent unsafe deserialization.
- Restrict and authenticate access to the Completions API by removing public exposure, enforcing strong authentication, and using rate limiting.
- Validate and filter all prompt embeddings through an API gateway, WAF, or middleware to block malformed or untrusted tensors before they reach vLLM.
- Isolate vLLM in hardened environments such as dedicated containers or VMs, using least privilege, segmentation, and non-privileged service accounts.
- Enable monitoring and logging for exploitation indicators, including crashes, malformed embeddings, deserialization failures, and abnormal inference behavior.
- Strengthen runtime and infrastructure security by applying ASLR, DEP/NX, network segmentation, access controls, and regular security testing such as fuzzing and dependency audits.
Layered defenses, continuous monitoring, and secure-by-design principles help ensure that future threats are detected earlier and contained more effectively.
AI Infrastructure Is Becoming a Prime Target
This vulnerability underscores a growing theme in AI security: the attack surface doesn’t just include the model — it includes the glue code, inference engines, serialization libraries, and data pipelines surrounding it.
As organizations adopt more LLM-powered capabilities, weaknesses in supporting frameworks such as vLLM, PyTorch, and model-serving APIs become increasingly attractive targets.
The incident also highlights how subtle upstream changes — in this case, PyTorch disabling integrity checks — can create security gaps that ripple across the AI ecosystem.
As AI infrastructure becomes more modular and interconnected, even minor deserialization flaws can escalate into full compromise if organizations fail to apply patches and enforce strict input validation.
