Efficient Neural Clustering and Compression of Strings Through Approximate Euclidean Embeddings of the Levenshtein Distance
Efficient information retrieval is a fundamental requirement in a wide array of applications, ranging from bioinformatics to database systems, and critically hinges on the complexity of an underlying distance metric. This work presents a novel approach to compute Levenshtein embeddings within Euclidean spaces by leveraging transformer-based models to obtain string embeddings with low distortion. L 2 embeddings of Levenshtein distance in 64-dimensions for sequences of length up to 256 are obtained by means of coarse-to-fine convolutions of learned 1,2,4-mer embeddings of the original string fed into transformer encoder stacks. These embeddings are shown to facilitate efficient retrieval and clustering processes, as well as to improve reference-based compression techniques by accelerating the search for similar sequences. Different type of experiments report: (1) the distortion of the embeddings, which characterizes the fidelity to the original Levenshtein metric by comparing the original distance with the estimated one between the corresponding L 2 embeddings; (2) the performance of top- k similar string retrieval, which assesses Jaccard similarity to verify the precision of the embedding space; (3) how well k -means over string embeddings approximates k -medoid clustering, with a comparative analysis against ground truth medoids, to evaluate the clustering efficiency of the embeddings; (4) the quality of reference-based compression, implemented by projecting a dataset of strings into the embedding space, computing a k-means clustering, and compressing the payload formed by concatenating cluster centers and the alignment tracebacks of each string to its assigned cluster center. Results of these preliminary experiments indicate that these embeddings (1) present low distortion for the target applications; (2) can be used to retrieve similar strings from a dataset of 10, 50, 100 random clusters with 40 strings per cluster, with inter-cluster distances of 5, 10, 25, 50; of strings of length 100 with more than 0.92 Jaccard similarity, (3) result in an "elbow"pattern in the inertia of the k-means clustering mirroring that of the k-medoid approximation with increasing k, and (4) achieve adequate reference based-compression through k-means clustering, with performance depending on the underlying distribution of the string dataset. These findings confirm that these string embeddings could be used to build tools for improving information retrieval in several domains.
2-s2.0-85194839158
2024
9798350385878
575
REVIEWED
EPFL
Event name | Event acronym | Event place | Event date |
Snowbird, United States | 2024-03-19 - 2024-03-22 | ||