find_big_files
Have you ever been low on disk space, and wanted to find the largest files, either on the whole machine, or below a certain directory? That's what this program does.
find_big_files should be put in the /usr/local/bin directory, and set to be executable with a "chmod 755 find_big_files" command. It can be renamed from "find_big_files.pl" to "find_big_files" so that it's easier to type.
If called without any arguments, it starts searching in the current working directory and every directory below, or you can pass a starting directory on the command line.
Examples:
find_big_files > /tmp/big_files_log
find_big_files / > /tmp/big_files_log
Note that in both examples, the output is saved in a log file, as it will be very large. The biggest files will be listed first, the smaller ones last.
find_big_files is a Perl script that does roughly the same thing as:
find . -exec ls -ld {} \; | grep -v ^d | sort -n -r --key=5
However, it is much faster, and if you run it on a large directory tree, it will print dots to the screen as a progress indicator so that you know it's doing something.
It is faster primarily because the pipeline above executes an "ls" for each directory encountered in the tree, and the script uses a stat() call, which is a lot less expensive time wise than the fork() and exec() that calling the "ls" does.
When run on the root (/) directory of a 3GHz P4 with 2G of RAM running Ubuntu 7.10 and having approximately 250,000 files taking up about 128G, it takes about 2 1/2 minutes of real time to complete, and a very large (17M) log file is produced. This is about 14 times faster than the 35 minutes the pipeline of shell commands takes.
When run on the root directory of a 450MHz AMD K6 with 256M of RAM running Centos 4 and having approximately 217,000 files taking up about 37G, it takes about 7 minutes, and a 15M log file is produced.
On a slower machine, there will be a very noticeable time between when the dots stop (it's no longer scanning the files) and when the output is printed. It's sorting by size during this time. I couldn't think of a way to print dots during sorting without noticeably slowing down the sort...
I solved the problem with the display of files greater than 2G. It was trivial as expected, I remembered how to fix it in an instant when I thought about it a few days later, and all it took was changing one character.



