Fast way to find duplicate data in MongoDB

I need to find out the duplicate data content in my 40 Millions records, then I can make the unique index to my name field.

> db.collecton.aggregate([
...     { $group : {_id : "$field_name", total : { $sum : 1 } } },
...     { $match : { total : { $gte : 2 } } },
...     { $sort : {total : -1} },
...     { $limit : 5 }],
... { allowDiskUse: true}    
...     );

{ "_id" : "data001", "total" : 2 }
{ "_id" : "data004231", "total" : 2 }
{ "_id" : "data00751", "total" : 2 }
{ "_id" : "data0021", "total" : 2 }
{ "_id" : "data001543", "total" : 2 }
> 

{ allowDiskUse: true} is optional if your data is not huge.

{ $limit : 5 }, you can set display more data.

How to Install MongoDB 3.2 on CentOS 7

vim /etc/yum.repos.d/mongodb.repo

Paste this to the file and save using :wq


[MongoDB]
name=MongoDB Repository
baseurl=http://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.2/x86_64/
gpgcheck=0
enabled=1

Download and install mongodb using yum

yum install mongodb-org -y

Start mongod and configure auto start while system boot

/etc/init.d/mongod restart
chkconfig mongod on

Check all the versions

[[email protected] ~]# mongo --version
MongoDB shell version: 3.2.3
[[email protected] ~]# mongod --version
db version v3.2.3
git version: b326ba837cf6f49d65c2f85e1b70f6f31ece7937
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
    distmod: rhel70
    distarch: x86_64
    target_arch: x86_64

Test the connection

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
> use test
switched to db test
> db.test.save( { juzhax: 1 } )
WriteResult({ "nInserted" : 1 })
> db.test.find()
{ "_id" : ObjectId("56d4ac48b376b143e4749229"), "juzhax" : 1 }

WARNING: /sys/kernel/mm/transparent_hugepage/enabled is ‘always’.

After I install MongoDB 3.2.3 in Centos 7, I received this error when I start mongo in shell.

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
connecting to: test
Server has startup warnings:
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten]
2016-02-29T14:11:49.308-0500 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

Solution

Create the init.d script.
Create the following file at /etc/init.d/disable-transparent-hugepages:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          disable-transparent-hugepages
# Required-Start:    $local_fs
# Required-Stop:
# X-Start-Before:    mongod mongodb-mms-automation-agent
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Disable Linux transparent huge pages
# Description:       Disable Linux transparent huge pages, to improve
#                    database performance.
### END INIT INFO

case $1 in
  start)
    if [ -d /sys/kernel/mm/transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/transparent_hugepage
    elif [ -d /sys/kernel/mm/redhat_transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/redhat_transparent_hugepage
    else
      return 0
    fi

    echo 'never' > ${thp_path}/enabled
    echo 'never' > ${thp_path}/defrag

    unset thp_path
    ;;
esac

Make it executable.
Run the following command to ensure that the init script can be used:

sudo chmod 755 /etc/init.d/disable-transparent-hugepages
sudo chkconfig --add disable-transparent-hugepages

WARNING: Cannot detect if NUMA interleaving is enabled. Failed to probe “/sys/devices/system/node/node1”: Permission denied

[[email protected] ~]# mongo
MongoDB shell version: 3.2.3
connecting to: test
Server has startup warnings:
2016-02-29T23:11:36.666+0700 I CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten] ** WARNING: Cannot detect if NUMA interleaving is enabled. Failed to probe "/sys/devices/system/node/node1": Permission denied
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten] Failed to probe "/sys/kernel/mm/transparent_hugepage": Permission denied
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 W CONTROL  [initandlisten] Failed to probe "/sys/kernel/mm/transparent_hugepage": Permission denied
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten]
2016-02-29T23:11:36.667+0700 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 262144 files. Number of processes should be at least 131072 : 0.5 times number of files.

Solution

I’m using the OVH kernel, so it is impossible to use with MongoDB, to solve this issue I have to install back the original kernel of the linux, then this error will be gone.

WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

I’ve received this error while starting mongo in shell while installing on

[[email protected] ~]# mongod --version
db version v3.2.3
git version: b326ba837cf6f49d65c2f85e1b70f6f31ece7937
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
    distmod: rhel70
    distarch: x86_64
    target_arch: x86_64
[[email protected] ~]# mongo --version
MongoDB shell version: 3.2.3
CentOS Linux release 7.2.1511 (Core)
WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.

Solution

vim /etc/security/limits.d/90-nproc.conf

Then put in

mongod     soft    nproc     64000

and

reboot

MongoDB Error: about to fork child process, waiting until server is ready for connections.

I tried to install MongoDB to CentOS 6.4 64Bit and found the error while I launch mongodb like this:

numactl --interleave=all /usr/bin/mongod -f /etc/mongod.conf

Here are the error code.

about to fork child process, waiting until server is ready for connections.
forked process: 9713
Wed Oct 16 02:00:00.640 terminate() called, printing stack (if implemented for platform):
0xdddd81 0x6cfbae 0x35d60203be6 0x35d60203c13 0x35d60203d0e 0x35d601a8ce7 0x35d60201a04 0x35d601ad3bc 0x35d601ae226 0xdfb5df 0xdfbf2b 0xdf8bd0 0x9ed4df 0x6dde80 0x6dfc29 0x35d5f938cdd 0x6cf999
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo11myterminateEv+0x3e) [0x6cfbae]
/usr/lib64/libstdc++.so.6(+0xbcbe6) [0x35d60203be6]
/usr/lib64/libstdc++.so.6(+0xbcc13) [0x35d60203c13]
/usr/lib64/libstdc++.so.6(+0xbcd0e) [0x35d60203d0e]
/usr/lib64/libstdc++.so.6(_ZSt21__throw_runtime_errorPKc+0x67) [0x35d601a8ce7]
/usr/lib64/libstdc++.so.6(+0xbaa04) [0x35d60201a04]
/usr/lib64/libstdc++.so.6(_ZNSt6locale5_ImplC2EPKcm+0x4c) [0x35d601ad3bc]
/usr/lib64/libstdc++.so.6(_ZNSt6localeC2EPKc+0x5f6) [0x35d601ae226]
/usr/bin/mongod(_ZN5boost11filesystem34path21wchar_t_codecvt_facetEv+0x4f) [0xdfb5df]
/usr/bin/mongod(_ZNK5boost11filesystem34path14root_directoryEv+0xbb) [0xdfbf2b]
/usr/bin/mongod(_ZN5boost11filesystem38absoluteERKNS0_4pathES3_+0x40) [0xdf8bd0]
/usr/bin/mongod(_ZN5mongo27initializeServerGlobalStateEb+0x15f) [0x9ed4df]
/usr/bin/mongod() [0x6dde80]
/usr/bin/mongod(main+0x9) [0x6dfc29]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x35d5f938cdd]
/usr/bin/mongod() [0x6cf999]
Wed Oct 16 02:00:00.645 Got signal: 6 (Aborted).

Wed Oct 16 02:00:00.649 Backtrace:
0xdddd81 0x6d0d29 0x35d5f94c960 0x35d5f94c8e5 0x35d5f94e0c5 0x6cfbb3 0x35d60203be6 0x35d60203c13 0x35d60203d0e 0x35d601a8ce7 0x35d60201a04 0x35d601ad3bc 0x35d601ae226 0xdfb5df 0xdfbf2b 0xdf8bd0 0x9ed4df 0x6dde80 0x6dfc29 0x35d5f938cdd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0d29]
/lib64/libc.so.6(+0x32960) [0x35d5f94c960]
/lib64/libc.so.6(gsignal+0x35) [0x35d5f94c8e5]
/lib64/libc.so.6(abort+0x175) [0x35d5f94e0c5]
/usr/bin/mongod(_ZN5mongo11myterminateEv+0x43) [0x6cfbb3]
/usr/lib64/libstdc++.so.6(+0xbcbe6) [0x35d60203be6]
/usr/lib64/libstdc++.so.6(+0xbcc13) [0x35d60203c13]
/usr/lib64/libstdc++.so.6(+0xbcd0e) [0x35d60203d0e]
/usr/lib64/libstdc++.so.6(_ZSt21__throw_runtime_errorPKc+0x67) [0x35d601a8ce7]
/usr/lib64/libstdc++.so.6(+0xbaa04) [0x35d60201a04]
/usr/lib64/libstdc++.so.6(_ZNSt6locale5_ImplC2EPKcm+0x4c) [0x35d601ad3bc]
/usr/lib64/libstdc++.so.6(_ZNSt6localeC2EPKc+0x5f6) [0x35d601ae226]
/usr/bin/mongod(_ZN5boost11filesystem34path21wchar_t_codecvt_facetEv+0x4f) [0xdfb5df]
/usr/bin/mongod(_ZNK5boost11filesystem34path14root_directoryEv+0xbb) [0xdfbf2b]
/usr/bin/mongod(_ZN5boost11filesystem38absoluteERKNS0_4pathES3_+0x40) [0xdf8bd0]
/usr/bin/mongod(_ZN5mongo27initializeServerGlobalStateEb+0x15f) [0x9ed4df]
/usr/bin/mongod() [0x6dde80]
/usr/bin/mongod(main+0x9) [0x6dfc29]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x35d5f938cdd]

ERROR: child process failed, exited with error number 14

Solved

The way i use to solve the problem by this.

[[email protected] ~]# export LANGUAGE=en_US.UTF-8
[[email protected] ~]# export LANG=en_US.UTF-8
[[email protected] ~]# export LC_ALL=en_US.UTF-8
[[email protected] ~]#
[[email protected] ~]# numactl --interleave=all /usr/bin/mongod -f /etc/mongod.conf
about to fork child process, waiting until server is ready for connections.
forked process: 9821
all output going to: /var/log/mongo/mongod.log
child process started successfully, parent exiting