Fast way to find duplicate data in MongoDB

I need to find out the duplicate data content in my 40 Millions records, then I can make the unique index to my name field.

> db.collecton.aggregate([
...     { $group : {_id : "$field_name", total : { $sum : 1 } } },
...     { $match : { total : { $gte : 2 } } },
...     { $sort : {total : -1} },
...     { $limit : 5 }],
... { allowDiskUse: true}    
...     );

{ "_id" : "data001", "total" : 2 }
{ "_id" : "data004231", "total" : 2 }
{ "_id" : "data00751", "total" : 2 }
{ "_id" : "data0021", "total" : 2 }
{ "_id" : "data001543", "total" : 2 }
> 

{ allowDiskUse: true} is optional if your data is not huge.

{ $limit : 5 }, you can set display more data.

How to Install MongoDB 3.2 on CentOS 7

vim /etc/yum.repos.d/mongodb.repo

Paste this to the file and save using :wq


[MongoDB]
name=MongoDB Repository
baseurl=http://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.2/x86_64/
gpgcheck=0
enabled=1

Download and install mongodb using yum

yum install mongodb-org -y

Start mongod and configure auto start while system boot

/etc/init.d/mongod restart
chkconfig mongod on

Check all the versions

[[email protected] ~]# mongo --version
MongoDB shell version: 3.2.3
[[email protected] ~]# mongod --version
db version v3.2.3
git version: b326ba837cf6f49d65c2f85e1b70f6f31ece7937
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
    distmod: rhel70
    distarch: x86_64
    target_arch: x86_64

Test the connection

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
> use test
switched to db test
> db.test.save( { juzhax: 1 } )
WriteResult({ "nInserted" : 1 })
> db.test.find()
{ "_id" : ObjectId("56d4ac48b376b143e4749229"), "juzhax" : 1 }

WARNING: /sys/kernel/mm/transparent_hugepage/enabled is ‘always’.

After I install MongoDB 3.2.3 in Centos 7, I received this error when I start mongo in shell.

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
connecting to: test
Server has startup warnings:
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

Solution

Create the init.d script.
Create the following file at /etc/init.d/disable-transparent-hugepages:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          disable-transparent-hugepages
# Required-Start:    $local_fs
# Required-Stop:
# X-Start-Before:    mongod mongodb-mms-automation-agent
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Disable Linux transparent huge pages
# Description:       Disable Linux transparent huge pages, to improve
#                    database performance.
### END INIT INFO

case $1 in
  start)
    if [ -d /sys/kernel/mm/transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/transparent_hugepage
    elif [ -d /sys/kernel/mm/redhat_transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/redhat_transparent_hugepage
    else
      return 0
    fi

    echo 'never' > ${thp_path}/enabled
    echo 'never' > ${thp_path}/defrag

    unset thp_path
    ;;
esac

Make it executable.
Run the following command to ensure that the init script can be used:

sudo chmod 755 /etc/init.d/disable-transparent-hugepages
sudo chkconfig --add disable-transparent-hugepages

WARNING: Cannot detect if NUMA interleaving is enabled. Failed to probe “/sys/devices/system/node/node1”: Permission denied

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
connecting to: test
Server has startup warnings:
2016-02-29T23:11:36.666+0700 I CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten] ** WARNING: Cannot detect if NUMA interleaving is enabled. Failed to probe "/sys/devices/system/node/node1": Permission denied
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten] Failed to probe "/sys/kernel/mm/transparent_hugepage": Permission denied
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten] Failed to probe "/sys/kernel/mm/transparent_hugepage": Permission denied
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 262144 files. Number of processes should be at least 131072 : 0.5 times number of files.

Solution

I’m using the OVH kernel, so it is impossible to use with MongoDB, to solve this issue I have to install back the original kernel of the linux, then this error will be gone.

WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

I’ve received this error while starting mongo in shell while installing on

[[email protected] ~]# mongod --version
db version v3.2.3
git version: b326ba837cf6f49d65c2f85e1b70f6f31ece7937
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
    distmod: rhel70
    distarch: x86_64
    target_arch: x86_64
[[email protected] ~]# mongo --version
MongoDB shell version: 3.2.3
CentOS Linux release 7.2.1511 (Core)
WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

Solution

vim /etc/security/limits.d/90-nproc.conf

Then put in

mongod     soft    nproc     64000

and

reboot