Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Efficient Neural Clustering and Compression of Strings Through Approximate Euclidean Embeddings of the Levenshtein Distance
 
conference paper

Efficient Neural Clustering and Compression of Strings Through Approximate Euclidean Embeddings of the Levenshtein Distance

Ozturk, Unsal  
•
Erturk, Utku Gorkem
•
Casale-Brunet, Simone
Show more
Bilgin, Ali
•
Fowler, James E.
Show more
2024
Data Compression Conference Proceedings
Data Compression Conference

Efficient information retrieval is a fundamental requirement in a wide array of applications, ranging from bioinformatics to database systems, and critically hinges on the complexity of an underlying distance metric. This work presents a novel approach to compute Levenshtein embeddings within Euclidean spaces by leveraging transformer-based models to obtain string embeddings with low distortion. L 2 embeddings of Levenshtein distance in 64-dimensions for sequences of length up to 256 are obtained by means of coarse-to-fine convolutions of learned 1,2,4-mer embeddings of the original string fed into transformer encoder stacks. These embeddings are shown to facilitate efficient retrieval and clustering processes, as well as to improve reference-based compression techniques by accelerating the search for similar sequences. Different type of experiments report: (1) the distortion of the embeddings, which characterizes the fidelity to the original Levenshtein metric by comparing the original distance with the estimated one between the corresponding L 2 embeddings; (2) the performance of top- k similar string retrieval, which assesses Jaccard similarity to verify the precision of the embedding space; (3) how well k -means over string embeddings approximates k -medoid clustering, with a comparative analysis against ground truth medoids, to evaluate the clustering efficiency of the embeddings; (4) the quality of reference-based compression, implemented by projecting a dataset of strings into the embedding space, computing a k-means clustering, and compressing the payload formed by concatenating cluster centers and the alignment tracebacks of each string to its assigned cluster center. Results of these preliminary experiments indicate that these embeddings (1) present low distortion for the target applications; (2) can be used to retrieve similar strings from a dataset of 10, 50, 100 random clusters with 40 strings per cluster, with inter-cluster distances of 5, 10, 25, 50; of strings of length 100 with more than 0.92 Jaccard similarity, (3) result in an "elbow"pattern in the inertia of the k-means clustering mirroring that of the k-medoid approximation with increasing k, and (4) achieve adequate reference based-compression through k-means clustering, with performance depending on the underlying distribution of the string dataset. These findings confirm that these string embeddings could be used to build tools for improving information retrieval in several domains.

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/DCC58796.2024.00092
Scopus ID

2-s2.0-85194839158

Author(s)
Ozturk, Unsal  

École Polytechnique Fédérale de Lausanne

Erturk, Utku Gorkem

École Polytechnique Fédérale de Lausanne

Casale-Brunet, Simone

Casale Brunet Consulting

Ribeca, Paolo

Biomathematics & Statistics Scotland

Mattavelli, Marco  

École Polytechnique Fédérale de Lausanne

Editors
Bilgin, Ali
•
Fowler, James E.
•
Serra-Sagrista, Joan
•
Ye, Yan
•
Storer, James A.
Date Issued

2024

Publisher

Institute of Electrical and Electronics Engineers Inc.

Published in
Data Compression Conference Proceedings
ISBN of the book

9798350385878

Start page

575

Subjects

euclidean embeddings

•

levenshtein distance

•

string search

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
SCI-STI-MM  
Event nameEvent acronymEvent placeEvent date
Data Compression Conference

Snowbird, United States

2024-03-19 - 2024-03-22

Available on Infoscience
January 26, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/244666
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés