Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Single-pass Detection of Jailbreaking Input in Large Language Models
 
research article

Single-pass Detection of Jailbreaking Input in Large Language Models

Candogan, Leyla
•
Wu, Yongtao  
•
Abad Rocamora, Elias  
Show more
February 2025
Transactions on Machine Learning Research

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

  • Files
  • Details
  • Metrics
Type
research article
ArXiv ID

2502.15435

Author(s)
Candogan, Leyla
Wu, Yongtao  

EPFL

Abad Rocamora, Elias  

EPFL

Chrysos, Grigorios  

EPFL

Cevher, Volkan  orcid-logo

EPFL

Date Issued

2025-02

Published in
Transactions on Machine Learning Research
Volume

02/2025

Subjects

ML-AI

URL

Link to OpenReview

https://openreview.net/forum?id=42v6I5Ut9a

Link to the code

https://github.com/LIONS-EPFL/SPD
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LIONS  
Available on Infoscience
August 29, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/253599
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés