Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Portage & Programming
  • Search

Python: Checksumming checksums

Problems with emerge or ebuilds? Have a basic programming question about C, PHP, Perl, BASH or something else?
Post Reply
Advanced search
6 posts • Page 1 of 1
Author
Message
northfrisia
Tux's lil' helper
Tux's lil' helper
Posts: 83
Joined: Mon Feb 01, 2021 4:55 am

Python: Checksumming checksums

  • Quote

Post by northfrisia » Tue Mar 31, 2026 5:11 am

Hi,

in the last days I checksummed all files of my whole system with a combination of find, parallel und xxhsum.
The results are now in a file per filessytem of the format

<checksum> <absolute path to the checksummed file>

. The goal is to find duplicated files, subdirectories and directories.

At the current state I am able to identifiy single file twins with a combination of sort and uniq from the
commandline, but the results are kind of useless: How often I will find the GNU Public license for example... ;)

The idea now is to checksum checksums (the ascii representation of them) of all files of a directory as the checksum
of that directory. Then step up one level and build the checksum of all checksums of the contents of the current level
as the checksum of that level, step one level up and so forth...

I want to implement that in python.

Question: I need a *fast* and good crc building tool/module/whatever, which I can use directly from within python,
which checksums strings (and not files only).
Opening a subproces for each checksum is not an option. Python is already slow ... and there a LOT of files.

What can I use for that purpose?
Top
b11n
Guru
Guru
User avatar
Posts: 301
Joined: Wed Mar 26, 2003 8:15 am
Location: New Zealand

Re: Python: Checksumming checksums

  • Quote

Post by b11n » Tue Mar 31, 2026 6:22 am

northfrisia wrote: The goal is to find duplicated files, subdirectories and directories.

[...]

but the results are kind of useless: How often I will find the GNU Public license for example... ;)
Well, do you want to find duplicated files, or not? :wink:

If the goal is to find duplicated files larger then N bytes (you don't specify), then exclude those from checksumming. Then, you might find the time taken to fire off a subprocess for each file not so bad. "A lot" is a relative term, ultimately you're going to be reading the entire content of the volume in question. I'd be surprised if the raw I/O isn't the bottleneck, especially if you use multiple threads.
Is there gas in the caaaaar?
Yes, there's gas in the caaaar
Top
northfrisia
Tux's lil' helper
Tux's lil' helper
Posts: 83
Joined: Mon Feb 01, 2021 4:55 am

  • Quote

Post by northfrisia » Tue Mar 31, 2026 7:50 am

...please reread my post.
Thanks
Top
NeddySeagoon
Administrator
Administrator
User avatar
Posts: 56076
Joined: Sat Jul 05, 2003 9:37 am
Location: 56N 3W

  • Quote

Post by NeddySeagoon » Tue Mar 31, 2026 9:26 am

northfrisia,
The goal is to find duplicated files, subdirectories and directories.
That's not a goal. It's a step along the way. It will give you a list.

What do you want to do with the list once you have it?

e.g. *identical* files/directories will have *identical* names and timestamps, so why checksum everything?
If deduplication is the aim, you don't say, move to a filesystem that does deduplication.

Tell us about the problem you want to solve, not a part of your percieved solution.
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Top
northfrisia
Tux's lil' helper
Tux's lil' helper
Posts: 83
Joined: Mon Feb 01, 2021 4:55 am

  • Quote

Post by northfrisia » Tue Mar 31, 2026 9:35 am

I see.
Thanks for your help.
Top
Goverp
Advocate
Advocate
User avatar
Posts: 2402
Joined: Wed Mar 07, 2007 6:41 pm

  • Quote

Post by Goverp » Tue Mar 31, 2026 10:16 am

There are several extant packages for identifying duplicate files and directories from checksums. But writing your own may scratch a singular itch, or be a learning experience.
See for example
app-misc/czkawka
app-misc/fdupe
app-misc/jdupes
app-misc/rdfind
app-arch/duff

(a Google search on gentoo file duplicate finder packages)
Greybeard
Top
Post Reply

6 posts • Page 1 of 1

Return to “Portage & Programming”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy