1. JESSY CHRISTADOSS - Senior Quality Engineer, Information Technology, Integral Ad Science, Texas, USA.
2. CHIRANJEEVI DEVI - Engineering Manager, Data Platform, Grammarly, Fremont, USA.
3. SRIKANTH GORLE - Senior Manager, Platform Engineering, CVS Health, Chicago, USA.
4. SWETHA RAVIPUDI - Engineering Manager, DevOps, Lucid Motors, Newark, USA.
5. TANUJ MATHUR - Business Development Director, Information Technology, Independent Scholar, Hempstead, USA.
This study explores the application of Generative Adversarial Networks (GANs) to the challenge of data deduplication in large datasets. Traditional deduplication methods, often reliant on hashing and exact matching, suffer from limitations in scalability and effectiveness when encountering near-duplicates. We propose a novel GAN-based framework that recasts deduplication as an adversarial learning problem, where a generator creates synthetic duplicates and a discriminator learns to identify them. By transforming deduplication into a predictive modeling task, the approach achieves improved detection accuracy and a substantial reduction in time complexity from O(n) to approximately O(log n). Empirical evaluations on synthetic large-scale datasets demonstrate significant gains in precision, recall, and runtime efficiency compared to conventional methods. This approach offers a scalable, intelligent alternative for managing redundant data in modern storage systems.
Generative Adversarial Networks (GANs), Data Deduplication, Synthetic Data Generation, Scalability Adversarial Learning.