Database Administrators and System Administrators have this in common: managing a large number of log files is just part of the job on Linux systems.
Tools such as logrotate significantly simplify the file management task for routinely created log files. Even so, there are still many ‘opportunities’ to exercise your command line fu to manage thousands or millions of files. These may be files that need to be moved, removed or searched.
When the files span multiple directories the find command is often used. The following command for instance will find all log files of a certain age and size and remove them.
find . -name "*.log" -size +1M -exec rm {} \;
For a few files this will work just fine, but what happens if the number of files to be processed is several thousands, or even millions?
The xargs Difference
Let’s first create 200k files to use for testing. These files will all be empty, there is no need for any content for these tests.
The script create.sh can be used to create the directories and empty files.
As it takes some time to create the files, we will not use the rm command here, but rather just the file command. The command will be timed as well.
# time find . -type f -name file_\* -exec file {} \; >/dev/null real 1m24.764s user 0m4.624s sys 0m12.581s
Perhaps 1 minute and 24 seconds seems to be a reasonable amount of time to process so many files.
It isn’t.
Let’s use a slightly different method to process these files, this time by adding xargs in a command pipe.
time find . -type f -name file_\* | xargs file >/dev/null real 0m0.860s user 0m0.456s sys 0m0.432s
Wait, what?! 0.8 seconds? Can that be correct?
Yes, it is correct. Using xargs with find can greatly reduce the resources needed to iterate through files.
How then, is is possible for the command that used xargs to complete so much faster than the command that did not use xargs?
When iterating through a list of files with the -exec argument to the find command, a new shell is forked for each execution of find.
For a large number of files this requires a lot of resources.
For demonstration purposes I will be using the ‘file’ command rather than ‘rm’.
Could it be that the xargs method may have benefited from the caching effects of running the first find command?
Could be – let’s run find … -exec again and see if it benefits from caching.
# time find . -type f -name file_\* -exec file {} \; >/dev/null real 1m25.722s user 0m3.900s sys 0m11.893s
Clearly any caching didn’t help find … -exec.
Why Is xargs Fast?
Why is the use of xargs so much faster than find? In short it is due to find starting a new process for each file it finds when the -exec option is used.
The command ‘find | xargs’ was wrapped in a shell script find-xargs.sh to facilitate the use of strace.
The find-xargs.sh script takes 2 arguments; the number of files to pipe to xargs and the number files that xargs should send to the file command for each invocation of file.
The number of files to process is controlled by piping the output of find to head.
The xargs –max-args argument is used to control how many arguments are sent to each invocation of find.
We can now use strace with the -c option; -c accumulates a count of all calls along with timing information.
Calling the script to run for the first 10000 files, with 1000 files sent to each invocation of find:
# strace -c -f ./find-xargs.sh 10000 1000 MAX_FILES: 10000 MAX_ARGS: 1000 Process 11268 attached Process 11269 attached ... Process 11267 resumed Process 11269 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.55 0.080017 5001 16 2 wait4 0.35 0.000280 0 12372 newfstatat 0.09 0.000074 0 208 getdents 0.01 0.000006 0 10000 lstat 0.00 0.000000 0 199 read 0.00 0.000000 0 276 1 write 0.00 0.000000 0 384 91 open 0.00 0.000000 0 313 4 close 0.00 0.000000 0 68 42 stat 0.00 0.000000 0 189 fstat 0.00 0.000000 0 5 1 lseek 0.00 0.000000 0 209 mmap 0.00 0.000000 0 71 mprotect 0.00 0.000000 0 37 munmap 0.00 0.000000 0 72 brk 0.00 0.000000 0 41 rt_sigaction 0.00 0.000000 0 80 rt_sigprocmask 0.00 0.000000 0 2 rt_sigreturn 0.00 0.000000 0 13 12 ioctl 0.00 0.000000 0 77 77 access 0.00 0.000000 0 2 pipe 0.00 0.000000 0 6 dup2 0.00 0.000000 0 1 getpid 0.00 0.000000 0 14 clone 0.00 0.000000 0 14 execve 0.00 0.000000 0 2 uname 0.00 0.000000 0 4 1 fcntl 0.00 0.000000 0 206 fchdir 0.00 0.000000 0 5 getrlimit 0.00 0.000000 0 1 getuid 0.00 0.000000 0 1 getgid 0.00 0.000000 0 1 geteuid 0.00 0.000000 0 1 getegid 0.00 0.000000 0 1 getppid 0.00 0.000000 0 1 getpgrp 0.00 0.000000 0 14 arch_prctl 0.00 0.000000 0 2 1 futex 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 set_robust_list ------ ----------- ----------- --------- --------- ---------------- 100.00 0.080377 24910 232 total
The largest chunk of time was spent in the wait4 system call. These are waits on execve, of which there were 14.
Of the 14 calls to execve, there was 1 each for the use of bash (the script itself), find, head and xargs, leaving 10 calls to be consumed by file.
The following command can be used if you would like to try this yourself:
strace -f -e trace=execve ./find-xargs.sh 10000 1000 2>&1 | grep execve
What happens when the same type of test is run against find with the -exec argument?
There is no method (that I can find in the man page anyway) by which we can limit the number of files that are sent to the program specified in the -exec argument of find.
We can still learn what is going on, it is just necessary to wait 1.5 minutes for the command to complete.
# strace -c -f find . -type f -name file_\* -exec file {} \; >/dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.80 4.101094 21 200000 wait4 0.69 0.029305 0 200000 clone 0.46 0.019278 0 2602351 1400007 open 0.44 0.018833 0 600001 munmap 0.31 0.013108 0 3200017 mmap 0.30 0.012715 0 1401173 fstat 0.16 0.006979 0 1200006 1200006 access 0.15 0.006543 0 1202345 close 0.15 0.006288 0 1000004 600003 stat 0.13 0.005632 0 1000004 read 0.12 0.004981 0 200000 lstat 0.09 0.003704 0 600026 brk 0.07 0.003016 0 1000009 mprotect 0.07 0.002776 0 200001 200000 ioctl 0.03 0.001079 0 201169 newfstatat 0.02 0.000806 0 2347 getdents 0.01 0.000600 0 200000 write 0.00 0.000003 0 200001 arch_prctl 0.00 0.000002 0 202341 fchdir 0.00 0.000000 0 3 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 400001 200000 execve 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 fcntl 0.00 0.000000 0 2 getrlimit 0.00 0.000000 0 2 1 futex 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 set_robust_list ------ ----------- ----------- --------- --------- ---------------- 100.00 4.236742 15811808 3600017 total
You may have noticed there are twice as many calls to execve than there were files to process.
This is due to something referenced in the comments of find-xarg.sh. Unless a full path name is specified when running a command, the PATH variable is searched for that command. If the command is not found by the first invocation of execve, then another attempt is made the next directory in PATH.
The following example shows the difference between using the command name only, and then using the fully pathed name of the file command.
# strace -e trace=execve -f find -maxdepth 1 -type f -name \*.sh -exec file {} \; 2>&1 | grep execve execve("/usr/bin/find", ["find", "-maxdepth", "1", "-type", "f", "-name", "*.sh", "-exec", "file", "{}", ";"], [/* 83 vars */]) = 0 [pid 9267] execve("/usr/local/bin/file", ["file", "./find-xargs.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory) [pid 9267] execve("/usr/bin/file", ["file", "./find-xargs.sh"], [/* 83 vars */]) = 0 [pid 9268] execve("/usr/local/bin/file", ["file", "./create.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory) [pid 9268] execve("/usr/bin/file", ["file", "./create.sh"], [/* 83 vars */]) = 0 [pid 9269] execve("/usr/local/bin/file", ["file", "./distribution.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory) [pid 9269] execve("/usr/bin/file", ["file", "./distribution.sh"], [/* 83 vars */]) = 0 # strace -e trace=execve -f find -maxdepth 1 -type f -name \*.sh -exec /usr/bin/file {} \; 2>&1 | grep execve execve("/usr/bin/find", ["find", "-maxdepth", "1", "-type", "f", "-name", "*.sh", "-exec", "/usr/bin/file", "{}", ";"], [/* 83 vars */]) = 0 [pid 9273] execve("/usr/bin/file", ["/usr/bin/file", "./find-xargs.sh"], [/* 83 vars */]) = 0 [pid 9274] execve("/usr/bin/file", ["/usr/bin/file", "./create.sh"], [/* 83 vars */]) = 0 [pid 9275] execve("/usr/bin/file", ["/usr/bin/file", "./distribution.sh"], [/* 83 vars */]) = 0
Too Much Space
Regardless of how bad a practice it may be, there will be times that file and directory names may contain space characters. Literal spaces, newlines and tabs can all play havoc with file name processing; xargs has you covered.
Two files are created to demonstrate:
# touch 'this filename has spaces' this-filename-has-no-spaces # ls -l total 0 -rw-r--r-- 1 jkstill dba 0 Apr 15 09:28 this filename has spaces -rw-r--r-- 1 jkstill dba 0 Apr 15 09:28 this-filename-has-no-spaces
What happens when the output of find it piped to xargs?
find . -type f | xargs file ./this-filename-has-no-spaces: empty ./this: ERROR: cannot open `./this' (No such file or directory) filename: ERROR: cannot open `filename' (No such file or directory) has: ERROR: cannot open `has' (No such file or directory) spaces: ERROR: cannot open `spaces' (No such file or directory)
The spaces in one of the filenames causes xargs to treat each word in the filename as a separate file.
Because of this it is a good idea to use the -print0 and -0 args as seen in the following example. These arguments change the output terminator of find to the null character, as well as changing the input terminator of xargs to the null character to deal with space characters in file and directory names.
find . -type f -print0 | xargs -0 file ./this-filename-has-no-spaces: empty ./this filename has spaces: empty
There is quite a bit more to xargs than this, I would encourage you to read the man page and experiment with the options to better learn how to make use of it.
Hope For find
For many versions of GNU find there is an easy modification that can be made to the command line that will cause the -exec option to emulate the method xargs uses pass input to a command.
Simply by changing -exec command {} \; to -exec command {} +, the find command will execute much faster than previously.
Here the find command has matched the performance of xargs when processing 200k files:
# time find . -type f -name file_\* -exec file {} + | wc 200000 400000 8069198 real 0m0.801s user 0m0.436s sys 0m0.404s
This may mean a quick and simple change to maintenance scripts can yield a very large increase in performance.
Does this mean there is no longer a need for xargs? Not really, as xargs offers levels of control over the input to piped commands that simply are not available in the find command.
If you’ve never used xargs, you should consider doing so, as it can reduce the resource usages on your systems and decrease the runtime for maintenance tasks.