in the last days I checksummed all files of my whole system with a combination of find, parallel und xxhsum.
The results are now in a file per filessytem of the format
<checksum> <absolute path to the checksummed file>
. The goal is to find duplicated files, subdirectories and directories.
At the current state I am able to identifiy single file twins with a combination of sort and uniq from the
commandline, but the results are kind of useless: How often I will find the GNU Public license for example...
The idea now is to checksum checksums (the ascii representation of them) of all files of a directory as the checksum
of that directory. Then step up one level and build the checksum of all checksums of the contents of the current level
as the checksum of that level, step one level up and so forth...
I want to implement that in python.
Question: I need a *fast* and good crc building tool/module/whatever, which I can use directly from within python,
which checksums strings (and not files only).
Opening a subproces for each checksum is not an option. Python is already slow ... and there a LOT of files.
What can I use for that purpose?



