My intent was to flag potentially duplicate files by identifying files of the same size. For those files, I would then perform full hash or quick / partial hash, depending on the file size.
My problem is with the intermediate step between identifying the potentially duplicate files (I have a query which does that), and before performing the hash (I have code ready to do that).
So I perform the query for potential duplicates, and these are the high level results:
Code: Select all
Db Total Records: 1_703_795
Total Duplicates: 1_664_774 (about 40k files have a unique file size)
Duplicate sizes : 69_845 (different file sizes with 2 or more files sharing that size)My thought was to create a separate index to track the potential duplicates. However, I don't think this will work given the quantity of foreign keys that would need to reference each record:
Code: Select all
Duplicate Files Table (ID: Primary KeY)
ID Size # Foreign Keys
1 0 54641
2 2 19256
3 7 11480
4 20 24529
5 4096 15055
6 100000 2
. ... 2
26150 9998901 2
. 10036 10
. ... 10
. 98955 10First, I'm not sure it helps. I'd still have to query the primary database and make a correlation with the Duplicate Files Table.
Second, I don't want to change the database as a first option. I know I'll be making other changes, so I'd like to delay until I've identified some other issues more concretely. For example, I expect I'll need to do something with the full path to a file to improve search granularity, but I have no idea how to do that yet. Making changes to the schema would require making code changes, and I'd rather not do that too many times.
I've not been able to come up with any other ideas, so I'm wondering if I'm unable to see past the solution I've already thought of, or if there just aren't many alternatives to the problem.
The schema for the primary database, although I don't think it particularly matters. I'm not sure how many of the st_* fields I'll need (stat_errno and stat_error need to change or go away).
Code: Select all
@sqlite> select sql from sqlite_master;
CREATE TABLE db (
id INTEGER NOT NULL PRIMARY KEY,
st_mode INTEGER, st_ino INTEGER, st_dev INTEGER,
st_nlink INTEGER, st_uid INTEGER, st_gid INTEGER,
st_size INTEGER, st_atime INTEGER, st_mtime INTEGER,
st_ctime INTEGER, stat_errno INTEGER, stat_error TEXT,
filetype TEXT, filename TEXT, path TEXT,
full_pathname TEXT
)


