Python: Checksumming checksums

Message

northfrisia · Post by **northfrisia** » Tue Mar 31, 2026 5:11 am

Hi,

in the last days I checksummed all files of my whole system with a combination of find, parallel und xxhsum.
The results are now in a file per filessytem of the format

<checksum> <absolute path to the checksummed file>

. The goal is to find duplicated files, subdirectories and directories.

At the current state I am able to identifiy single file twins with a combination of sort and uniq from the
commandline, but the results are kind of useless: How often I will find the GNU Public license for example...

The idea now is to checksum checksums (the ascii representation of them) of all files of a directory as the checksum
of that directory. Then step up one level and build the checksum of all checksums of the contents of the current level
as the checksum of that level, step one level up and so forth...

I want to implement that in python.

Question: I need a *fast* and good crc building tool/module/whatever, which I can use directly from within python,
which checksums strings (and not files only).
Opening a subproces for each checksum is not an option. Python is already slow ... and there a LOT of files.

What can I use for that purpose?

b11n · Post by **b11n** » Tue Mar 31, 2026 6:22 am

northfrisia wrote: The goal is to find duplicated files, subdirectories and directories.

[...]

but the results are kind of useless: How often I will find the GNU Public license for example... ;)

Well, do you want to find duplicated files, or not? :wink:

If the goal is to find duplicated files larger then N bytes (you don't specify), then exclude those from checksumming. Then, you might find the time taken to fire off a subprocess for each file not so bad. "A lot" is a relative term, ultimately you're going to be reading the entire content of the volume in question. I'd be surprised if the raw I/O isn't the bottleneck, especially if you use multiple threads.

northfrisia · Post by **northfrisia** » Tue Mar 31, 2026 7:50 am

...please reread my post.
Thanks

Post by **NeddySeagoon** » Tue Mar 31, 2026 9:26 am

northfrisia,

The goal is to find duplicated files, subdirectories and directories.

That's not a goal. It's a step along the way. It will give you a list.

What do you want to do with the list once you have it?

e.g. *identical* files/directories will have *identical* names and timestamps, so why checksum everything?
If deduplication is the aim, you don't say, move to a filesystem that does deduplication.

Tell us about the problem you want to solve, not a part of your percieved solution.

northfrisia · Post by **northfrisia** » Tue Mar 31, 2026 9:35 am

I see.
Thanks for your help.

Goverp · Post by **Goverp** » Tue Mar 31, 2026 10:16 am

There are several extant packages for identifying duplicate files and directories from checksums. But writing your own may scratch a singular itch, or be a learning experience.
See for example
app-misc/czkawka
app-misc/fdupe
app-misc/jdupes
app-misc/rdfind
app-arch/duff

(a Google search on gentoo file duplicate finder packages)

Python: Checksumming checksums

Python: Checksumming checksums

Re: Python: Checksumming checksums