Fast way to find duplicate data in MongoDB

2016-05-19 / By juzhax / Linux / data, duplicate, linux, MongoDB

I need to find out the duplicate data content in my 40 Millions records, then I can make the unique index to my name field.

> db.collecton.aggregate([
...     { $group : {_id : "$field_name", total : { $sum : 1 } } },
...     { $match : { total : { $gte : 2 } } },
...     { $sort : {total : -1} },
...     { $limit : 5 }],
... { allowDiskUse: true}    
...     );

{ "_id" : "data001", "total" : 2 }
{ "_id" : "data004231", "total" : 2 }
{ "_id" : "data00751", "total" : 2 }
{ "_id" : "data0021", "total" : 2 }
{ "_id" : "data001543", "total" : 2 }
>

{ allowDiskUse: true} is optional if your data is not huge.

{ $limit : 5 }, you can set display more data.

Fast way to find duplicate data in MongoDB

Tags:

Leave a Comment Cancel Reply

My Projects

Support

Subscribe to Blog via Email