Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Duplicate detection against data leakage in a company setting
 
master thesis

Duplicate detection against data leakage in a company setting

Duchesne, Nathan
September 9, 2024

The DupeLeak8 tool provides a standalone and easy to deploy solution for detecting potential data leaks due to near-duplicates in a large file system. It performs efficient all-to-all file comparisons which results in a global coverage of the system’s textual duplication issues and compares the detected duplicate’s classification levels to check for inconsistencies: DupeLeak8 assumes files with similar content are more likely to have similar classification levels. It also provides users with a one- to-all comparison method which can be used to monitor duplicated content in environments with constant file turnover, such as a company file system. DupeLeak8 offers flexible and customizable settings to adapt to the user’s needs as best as possible. DupeLeak8 focuses on detecting data breaches due to exact or partial textual duplicates at a granularity level chosen by the user, by leveraging advantages of both local and global similarity analysis techniques. The tool is implemented in Java and relies on widely audited and tested open-source libraries. DupeLeak8 boasts a user-friendly API, allowing users to execute our application with as few as three lines of code. We evaluate DupeLeak8 on a large dataset of 6k enterprise internal documents for a total size of 22GB. All-to-all comparisons took 13 minutes and one-to-all queries can be performed interactively in less than 1 second.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

Nathan-Duchesne.pdf

Type

Main Document

Version

http://purl.org/coar/version/c_970fb48d4fbd8a85

Access type

openaccess

License Condition

CC BY

Size

2.31 MB

Format

Adobe PDF

Checksum (MD5)

611a19c072a911f76bcdbce28d00413a

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés