GENERATIVE ADVERSARIAL NETWORKS (GANs) FOR OPTIMIZING  DATA DEDUPLICATION IN LARGE DATASETS

JESSY CHRISTADOSS, CHIRANJEEVI DEVI, SRIKANTH GORLE, SWETHA RAVIPUDI, TANUJ MATHUR

Manuscript Title:

GENERATIVE ADVERSARIAL NETWORKS (GANs) FOR OPTIMIZING DATA DEDUPLICATION IN LARGE DATASETS

Author:

JESSY CHRISTADOSS, CHIRANJEEVI DEVI, SRIKANTH GORLE, SWETHA RAVIPUDI, TANUJ MATHUR

DOI Number:

DOI:10.5281/zenodo.16029141

Published : 2025-07-23

About the author(s)

1. JESSY CHRISTADOSS - Senior Quality Engineer, Information Technology, Integral Ad Science, Texas, USA.
2. CHIRANJEEVI DEVI - Engineering Manager, Data Platform, Grammarly, Fremont, USA.
3. SRIKANTH GORLE - Senior Manager, Platform Engineering, CVS Health, Chicago, USA.
4. SWETHA RAVIPUDI - Engineering Manager, DevOps, Lucid Motors, Newark, USA.
5. TANUJ MATHUR - Business Development Director, Information Technology, Independent Scholar, Hempstead, USA.

Full Text : PDF

Abstract

This study explores the application of Generative Adversarial Networks (GANs) to the challenge of data deduplication in large datasets. Traditional deduplication methods, often reliant on hashing and exact matching, suffer from limitations in scalability and effectiveness when encountering near-duplicates. We propose a novel GAN-based framework that recasts deduplication as an adversarial learning problem, where a generator creates synthetic duplicates and a discriminator learns to identify them. By transforming deduplication into a predictive modeling task, the approach achieves improved detection accuracy and a substantial reduction in time complexity from O(n) to approximately O(log n). Empirical evaluations on synthetic large-scale datasets demonstrate significant gains in precision, recall, and runtime efficiency compared to conventional methods. This approach offers a scalable, intelligent alternative for managing redundant data in modern storage systems.

Keywords

Generative Adversarial Networks (GANs), Data Deduplication, Synthetic Data Generation, Scalability Adversarial Learning.