Duplicate detection against data leakage in a company setting
The DupeLeak8 tool provides a standalone and easy to deploy solution for detecting potential data
leaks due to near-duplicates in a large file system. It performs efficient all-to-all file comparisons
which results in a global coverage of the system’s textual duplication issues and compares the
detected duplicate’s classification levels to check for inconsistencies: DupeLeak8 assumes files with
similar content are more likely to have similar classification levels. It also provides users with a one-
to-all comparison method which can be used to monitor duplicated content in environments with
constant file turnover, such as a company file system. DupeLeak8 offers flexible and customizable
settings to adapt to the user’s needs as best as possible.
DupeLeak8 focuses on detecting data breaches due to exact or partial textual duplicates at a
granularity level chosen by the user, by leveraging advantages of both local and global similarity
analysis techniques.
The tool is implemented in Java and relies on widely audited and tested open-source libraries.
DupeLeak8 boasts a user-friendly API, allowing users to execute our application with as few as three
lines of code.
We evaluate DupeLeak8 on a large dataset of 6k enterprise internal documents for a total size of
22GB. All-to-all comparisons took 13 minutes and one-to-all queries can be performed interactively
in less than 1 second.
Nathan-Duchesne.pdf
main document
openaccess
CC BY
2.31 MB
Adobe PDF
611a19c072a911f76bcdbce28d00413a