[SOLVED] SQL create an index or something else? - Gentoo Forums

[SOLVED] SQL create an index or something else?

21 posts • Page 1 of 1

Author

Message

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

[SOLVED] SQL create an index or something else?

Quote

Post by pjp » Wed May 10, 2023 9:31 pm

I have an SQLite database that contains file metadata (st_* such as size, time) and a couple of other items. The point of the database is to facilitate deletion of files I don't need.

My intent was to flag potentially duplicate files by identifying files of the same size. For those files, I would then perform full hash or quick / partial hash, depending on the file size.

My problem is with the intermediate step between identifying the potentially duplicate files (I have a query which does that), and before performing the hash (I have code ready to do that).

So I perform the query for potential duplicates, and these are the high level results:

Code: Select all

Db Total Records:   1_703_795
Total Duplicates:   1_664_774   (about 40k files have a unique file size)
Duplicate sizes :      69_845   (different file sizes with 2 or more files sharing that size)

I was surprised by how many file sizes were shared with one or more other files.

My thought was to create a separate index to track the potential duplicates. However, I don't think this will work given the quantity of foreign keys that would need to reference each record:

Code: Select all

Duplicate Files Table (ID: Primary KeY)

ID      Size        # Foreign Keys
1       0           54641
2       2           19256
3       7           11480
4       20          24529
5       4096        15055
6       100000      2   
.       ...         2   
26150   9998901     2   
.       10036       10  
.       ...         10  
.       98955       10

The immediately obvious possible solution there would be to modify the primary index and have those records reference a record in this Duplicate Files Table that corresponds with the appropriate file size. I have two problems with this.

First, I'm not sure it helps. I'd still have to query the primary database and make a correlation with the Duplicate Files Table.

Second, I don't want to change the database as a first option. I know I'll be making other changes, so I'd like to delay until I've identified some other issues more concretely. For example, I expect I'll need to do something with the full path to a file to improve search granularity, but I have no idea how to do that yet. Making changes to the schema would require making code changes, and I'd rather not do that too many times.

I've not been able to come up with any other ideas, so I'm wondering if I'm unable to see past the solution I've already thought of, or if there just aren't many alternatives to the problem.

The schema for the primary database, although I don't think it particularly matters. I'm not sure how many of the st_* fields I'll need (stat_errno and stat_error need to change or go away).

Code: Select all

@sqlite> select sql from sqlite_master;
CREATE TABLE db (
        id INTEGER NOT NULL PRIMARY KEY,
        st_mode INTEGER,    st_ino INTEGER,     st_dev INTEGER,
        st_nlink INTEGER,   st_uid INTEGER,     st_gid INTEGER,
        st_size INTEGER,    st_atime INTEGER,   st_mtime INTEGER,
        st_ctime INTEGER,   stat_errno INTEGER, stat_error TEXT,
        filetype TEXT,      filename TEXT,      path TEXT,
        full_pathname TEXT
    )

Last edited by pjp on Sun May 14, 2023 3:25 pm, edited 1 time in total.

Quis separabit? Quo animo?

hdcg: Tux's lil' helper; Posts: 122; Joined: Sun Apr 07, 2013 8:30 am

Quote

Post by hdcg » Thu May 11, 2023 5:01 am

My 2 cents regarding some of your points:

My thought was to create a separate index to track the potential duplicates.

You already have this relationship. The foreign key is actually the file size. An index on the file size may be everything you need to perform the following pseudo code algorithm:

Code: Select all

for each file size from (select distinct st_size from ... order by st_size)
  for each file of current file size (select filename from ... where st_size = cur_size)
    do duplicate check

My intent was to flag potentially duplicate files by identifying files of the same size. For those files, I would then perform full hash or quick / partial hash, depending on the file size.

Looking at your numbers most of the files have non-unique file sizes anyway. Therefore question is, whether it makes sense to hash all files directly and so create a much better key for potential duplicate searches.

Goverp: Advocate; Posts: 2401; Joined: Wed Mar 07, 2007 6:41 pm

Quote

Post by Goverp » Thu May 11, 2023 8:24 am

pjp, you've probably already considered them, or have too much invested in your current arrangement, but there are packages to do exactly what you are looking at (such as app-misc/rdfind), unless of course there are other considerations not directly relevant to your question. Or maybe the volumes concerned.

Greybeard

szatox: Advocate; Posts: 3858; Joined: Tue Aug 27, 2013 12:35 pm

Quote

Post by szatox » Thu May 11, 2023 9:24 am

I've been thinking about something similar actually, though using a fast hash function from the get go. Perhaps CRC would flatten the distribution?
Oh, it might be a good idea to ignore empty files, since they are typically made as flags rather than content.
Avoid secure hashes for this, they are intentionally slow.

I expect I'll need to do something with the full path to a file to improve search granularity, but I have no idea how to do that yet

Does SQLite allow indexing text-type fields and select ... where `column` like 'pattern%' ? Paths go from coarse granularity to fine, so it should cover arbitrary granularity.

Edit: There's dev-libs/xxhash in portage.
Description: Extremely fast non-cryptographic hash algorithm
Sound's like a good candidate.

wjb: l33t; Posts: 681; Joined: Sun Jul 10, 2005 9:40 am; Location: Fife, Scotland

Quote

Post by wjb » Thu May 11, 2023 4:16 pm

I think I'd use a temporary table to record the results of a file scan, then compare that with the main table to decide whether to get a hash or not. But I'd always aim for having a hash in the main table, which is expensive the first time but not really after that. Also use mtime as well as size to do the quick check for changes because in-place changes are a thing. Try and do all the heavy lifting with SQL, and with batch updates/inserts after processing rather than loads of individual queries in iteration code. And use every core you have for getting the hash values.

Kind of curious what the 2/7/20 byte files are given the numbers of them??

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Thu May 11, 2023 9:39 pm

hdcg wrote:
My thought was to create a separate index to track the potential duplicates.
You already have this relationship. The foreign key is actually the file size. An index on the file size may be everything you need to perform the following pseudo code algorithm:
Code: Select all
for each file size from (select distinct st_size from ... order by st_size)
  for each file of current file size (select filename from ... where st_size = cur_size)
    do duplicate check

If I'm understanding this correctly, the first loop iterates over a list of unique file sizes returned by the query, and the second loop iterates over every file that is of a given size, correct? If so, that would include the unique files too.

What I did was a little different:

Code: Select all

    # sql query returns  id,st_size for all rows.

    # Add each record to a dict where keys are unique file sizes
    # value is the record ID
    for (value,key) in rows:
        if key in duplst:
            duplst[key].append(value)
        else:
            duplst[key] = [value]

    # Consider only file sizes where more than one file is of that size.
    for sz in duplst:
        lenlst=len(duplst[sz])
        if lenlst > 1:

I presumed my method wasn't very efficient, but decided to work on performance later. It's not _that_ slow, considering the number of files. I'll try your method to see how it works.

What I had in mind was to then populate an index so that only duplicates were listed in that index so that I could create hashes based on the index, automatically excluding unique files. Given my particular data set, that may not be as useful as I fist thought. Nevertheless, it _seems_ like a Good Idea. Also, I had been conflating table and index to be mostly equivalent, so I have to go back to understand how indexes work to understand how it could exclude unique file sizes.

hdcg wrote:Looking at your numbers most of the files have non-unique file sizes anyway. Therefore question is, whether it makes sense to hash all files directly and so create a much better key for potential duplicate searches.

There is that. When I first looked into how to determine uniqueness, file size seemed to be a key initial filter. If files are of different size, then by definition they cannot be exactly the same.

The simple solution would be to hash everything and see if the time that takes is worth a different approach. My presumption was that it would be worth it. I may reconsider. I'll have to do some more thinking, but I'm starting to lean that way just to move on from this problem.

Thanks!

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Thu May 11, 2023 9:51 pm

Goverp wrote:pjp, you've probably already considered them, or have too much invested in your current arrangement, but there are packages to do exactly what you are looking at (such as app-misc/rdfind), unless of course there are other considerations not directly relevant to your question. Or maybe the volumes concerned.

I believe I have looked at that one and some similar, but I do appreciate the suggestion.

My first attempt a long time ago was with fdupes. My issue with these tools is that the output requires too much manual manipulation to be valuable for how I'd like use such a tool. With fdupes, I managed to delete some files I would rather have not deleted. They were that important, and I no longer recall anything about them.

After that experience I decided that what I wanted was an interactive tool that looked something like a file manager. If I can get there, one feature will be able to repackage the desired files into a new archive. So given say 10 directories named foo1 ... foo10: show me all unique files, preferring the oldest. Allow me to filter files I don't want. Then, put them in a tar file as 'new_foo/blah'.

At least that's the idea for now. Who knows what happens. :)

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Thu May 11, 2023 10:13 pm

szatox wrote:I've been thinking about something similar actually, though using a fast hash function from the get go. Perhaps CRC would flatten the distribution?
Oh, it might be a good idea to ignore empty files, since they are typically made as flags rather than content.
Avoid secure hashes for this, they are intentionally slow.

My initial research suggested that it might be faster to eliminate files that are easily determined to not be duplicates because they don't share the same file size, so that's how I cam to my current approach. I did consider hashing everything. Quickly thinking about that possibility now, I do wonder if the tool might be much simpler. I may have to do some testing. I briefly looked at different hashes and considered crc32. I don't recall why I rejected that one, but rather than get bogged down into the best hash, I just went with sha1 just to get something working. For small files (where small is undefined), I would hash the entire file. For whatever is large, I'd hash the beginning and end of the file. I'm currently using an arbitrary value of 8_388_608 for beginning and. I have no idea what an optimal size should be.

szatox wrote:Does SQLite allow indexing text-type fields and select ... where `column` like 'pattern%' ? Paths go from coarse granularity to fine, so it should cover arbitrary granularity.

I can do pattern matching. I learned that of my 54k size 0 files, only 242 are ".keep" files. It seems capable of most SQLy things, so I would be urprised if it can't index text fields. I wasn't sure how well that kind of filter would work in practice, so I'll have to test that. My main thought there had more to do with the UI. If pages of paths are presented to the user, how quickly can a filter in a text box work with sql data? I'm guessing it would have to be cached, then verified. I have no clue how any of that is done.

One way I can imagine wanting to interact with the UI is to see the .keep files, then think, OK, show me all of those files. Then determine that I don't want all of them. And from the new results excluding those files, I look for the next filter.

szatox wrote:Edit: There's dev-libs/xxhash in portage.
Description: Extremely fast non-cryptographic hash algorithm
Sound's like a good candidate.

Thanks I've made a todo item to review that at some point. Portage even has python bindings (dev-python/xxhash).

Even if everything is hashed from the start, there would still need to be a means of determining which files were duplicates, so I suppose I'll need to understand how to create that index, regardless of the method used to identify duplicates.

Thanks for the feedback!

Quis separabit? Quo animo?

Hu: Administrator; Posts: 24380; Joined: Tue Mar 06, 2007 5:38 am

Quote

Post by Hu » Thu May 11, 2023 10:19 pm

pjp wrote:What I did was a little different:

Code: Select all

    for (value,key) in rows:
        if key in duplst:
            duplst[key].append(value)
        else:
            duplst[key] = [value]

This can be simplified by using collections.defaultdict.

Code: Select all

import collections

# later...
duplst = collections.defaultdict(list)
for value, key in rows:
	duplst[key].append(value)

If there is no element identified by key, then list is called (returning an empty list) and that returned value is added to the dictionary under the requested key. Then you can always append to the list, whether or not it existed before the statement started.

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Thu May 11, 2023 11:43 pm

wjb wrote:I think I'd use a temporary table to record the results of a file scan, then compare that with the main table to decide whether to get a hash or not.

I mentioned in another post that I was confusing an index and a table, so perhaps what I was thinking of was a temporary table rather than an index. Loosely, though, that's the general idea. The initial scan of a directory would populate the contents. Then from there, I'd do something that resulted in a temporary table, and index, or something else. Using the temp table / index / something else, I would then selectively generate hashes and add that to the main table. My presumption was that I'd need relatively few hashes and I'd be able to perform a simple search on any file in the main table that had a hash. Obviously that presumption was wrong in a big way.

wjb wrote:But I'd always aim for having a hash in the main table, which is expensive the first time but not really after that.

As my intended use is to interact with the data similar to how a file manager might be used, I was thinking that delaying when the hash was performed would improve interacting with the UI. That approach also allowed me to delay when I had to figure out how to hash a file, which turned out to not be too difficult in the general case.

wjb wrote:Also use mtime as well as size to do the quick check for changes because in-place changes are a thing.

Good point. I hadn't really considered in-place changes as the data I'm most interested in is dormant. Determining when to do those checks is an interesting thought exercise. Before deleting anything would be obvious. Perhaps while creating an archive. At some point, if data is that active, there's always a chance of a problem. Maybe pyinotify will be useful. I've added a todo for testing pyinotify.

wjb wrote:Try and do all the heavy lifting with SQL, and with batch updates/inserts after processing rather than loads of individual queries in iteration code. And use every core you have for getting the hash values.

Good points. Right now I'm probably doing one insert at a time, though I don't commit until they're all done. The only select I'm doing is for everything to then use python to identify which files are duplicates. Since I'm not exactly sure what I'll have, Initial speed doesn't seem too bad. The second db includes the directory structure that is the first db:

Code: Select all

creating...

real    0m42.677s
user    0m38.632s
sys     0m3.281s


Records: 66251
Size   : 165M


creating...

real    1m47.962s
user    1m38.280s
sys     0m8.353s

Records: 1703795
Size   : 411M

I'm guessing when I start working on the UI is when I'll have to pay more attention to how I'm performing queries.

wjb wrote:Kind of curious what the 2/7/20 byte files are given the numbers of them??

It seems to vary quite a bit, but partly points to why I'm wanting the tool in the first place.

db/pkg/kde-base/kwalletd-4.7.3/SIZE

There are 2488 7 byte files with the name SIZE.

If all files of name SIZE are essentially the same type of file:

Filename SIZE byte sizes: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
Total files named SIZE: 7296

I see some AndroidStudio json files, other files under db/pkg.

In an OS related backups folder, there are 66159 files, 35786 of which are not part of 'db/pkg'.

There are areas I could delete right now without much thought. But there are other areas where I know I want to be more careful. The 411MB database references data that I've collected since at least the mid 2000s. I hope it will be easier to find things once I thin out the chafe.

Thanks for the suggestions!

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Thu May 11, 2023 11:48 pm

Hu wrote:This can be simplified by using collections.defaultdict.

Code: Select all

import collections[/quote] Thank you! I've added it to my todo list. Even though it seems obvious, I generally try to understand something fairly well before using it instead of just inserting a black box.

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Fri May 12, 2023 5:45 am

Well, I may have an answer:

Code: Select all

@sqlite> CREATE TEMP TABLE stszuniq AS SELECT st_size FROM db WHERE st_size > 0 GROUP BY st_size HAVING COUNT(*) > 1;
@sqlite> select count(*) from stszuniq;
69843

It took me longer than I wish to realize why the count is two less than my original number of file sizes that were shared with at least one other file. Due to permission issues, there are 57 records where st_size is NULL. I hadn't yet caught that oversight despite having seen an empty line in the list of sizes that were supposedly duplicates.

I think that should allow me to eliminate managing the list within python, and simply process a query of that table to create a new hash entry in the primary table. As a bonus, I think it might be faster too.

I'll hopefully get that tested tomorrow and will provide results as soon as I have them.

Thanks again! If I would have come to that solution without the help, I don't think it would have been any time soon.

Quis separabit? Quo animo?

szatox: Advocate; Posts: 3858; Joined: Tue Aug 27, 2013 12:35 pm

Quote

Post by szatox » Fri May 12, 2023 8:10 am

Even if everything is hashed from the start, there would still need to be a means of determining which files were duplicates, so I suppose I'll need to understand how to create that index, regardless of the method used to identify duplicates.

Ok, I think I know where the confusion comes from. The thing YOU consider an index and the thing a DATABASE considers and index are 2 different thing.
Your "index" is the database of files, because this is what you're going to search for interesting bits of data. For the database index is a space with pointers to your records, ordered by one or more columns.
e.g. your PRIMARY KEY is an index, and so is any other KEY added to a table. PRIMARY, FOREIGN and UNIQUE often seen in CREATE TABLE commands are constraints created on top of "KEY" more often referred to as INDEX. Constraints inform the database how it should handle some relations within the data, and the key allows enforcing those constraints in a timely manner.

Since index is ordered, queries using it can run way faster than those doing full scan on the table.
Querying files for path like 'pattern%' (notice % at the end but NOT at the begining) should trigger a range-type access on an indexed column. LIKE '%pattern%' would still do a full table scan since this pattern in not anchored; some db engines provide special types of index for text using proximity search, but it's tricky and out of scope here.
Anyway, if you use full paths with an index, the search _should_ be very fast. Search for basename on the same column definitely won't be.

I gave xxhash a shot. It IS fast. 64 bit version on my 64bit cpu slurps 2GB file in 0.7 sec. Clearly beats crc. I saw some information that it's not yet mature and the resulting hashes may change in future versions, so it shouldn't be used for long-term storage, not sure if this information is up to date.

wjb: l33t; Posts: 681; Joined: Sun Jul 10, 2005 9:40 am; Location: Fife, Scotland

Quote

Post by wjb » Fri May 12, 2023 10:49 am

I still think there are only a few jobs the python has to do:

Collect the list of all files/attributes (not hash)
Given a list of files, generate a list of their hashes
Output the result of the final report query

The SQL side can/should do the rest. Remember that INSERT/UPDATE/DELETE can all use a SELECT to get a list of values (tuples) to operate on, see the SQLite tramline diagrams, "expr" can be "(select .......)". This is back to front compared to doing it in python, where you'd do the select first and loop over the result doing inserts/updates/deletes as you go, but the point is that the database can do it more efficiently (in multiple threads where possible).

I've actually just been through this to add an sqlite-powered index to my collection of scanned letters/bills, previously organised only by a file naming convention. This has also involved sorting out differences in on-disk and in-database versions of reality.

SQLiteStudio can be very convenient for trying out complicated SQL compared to using the sqlite command line.

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Sat May 13, 2023 9:44 pm

szatox wrote:
Even if everything is hashed from the start, there would still need to be a means of determining which files were duplicates, so I suppose I'll need to understand how to create that index, regardless of the method used to identify duplicates.
Ok, I think I know where the confusion comes from. The thing YOU consider an index and the thing a DATABASE considers and index are 2 different thing.
Your "index" is the database of files, because this is what you're going to search for interesting bits of data. For the database index is a space with pointers to your records, ordered by one or more columns.
e.g. your PRIMARY KEY is an index, and so is any other KEY added to a table. PRIMARY, FOREIGN and UNIQUE often seen in CREATE TABLE commands are constraints created on top of "KEY" more often referred to as INDEX. Constraints inform the database how it should handle some relations within the data, and the key allows enforcing those constraints in a timely manner.

Primarily the word 'table' doesn't seem like the correct word for what it is. And because I don't use indexes, my mind jumps to the word index as what seems like should be the name for where table is used. That said, I also didn't realize that (apparently?) an actual index can't contain a subset of the data it references? So, an index that would contain only st_size where count > 1. I knew I wasn't thinking of index correctly when I posted the question, but I thought I could "get away with it" by explaining what I was trying to achieve. Also, I don't think the description I read about index helped clarify my confusion nearly as much as your comment about it being pointers, so thanks for that too!

szatox wrote:Since index is ordered, queries using it can run way faster than those doing full scan on the table.
Querying files for path like 'pattern%' (notice % at the end but NOT at the begining) should trigger a range-type access on an indexed column. LIKE '%pattern%' would still do a full table scan since this pattern in not anchored; some db engines provide special types of index for text using proximity search, but it's tricky and out of scope here.
Anyway, if you use full paths with an index, the search _should_ be very fast. Search for basename on the same column definitely won't be.

I'll have to make a note of that for when I get to doing path selection as part of filtering.

szatox wrote:I gave xxhash a shot. It IS fast. 64 bit version on my 64bit cpu slurps 2GB file in 0.7 sec. Clearly beats crc. I saw some information that it's not yet mature and the resulting hashes may change in future versions, so it shouldn't be used for long-term storage, not sure if this information is up to date.

As long as it's reasonable with collisions, I don't need it for long term use, so that helps.

Quis separabit? Quo animo?

szatox: Advocate; Posts: 3858; Joined: Tue Aug 27, 2013 12:35 pm

Quote

Post by szatox » Sat May 13, 2023 9:57 pm

So... I went and inspected my system.
Unsuprisingly, .keep files take the cake.
Followed by shitload of CHOST and CXXFLAG files in /var/db/pkg. This answers question what those 7 and 20 byte files are, and possibly the other legions of identical non-empty files.

And after filtering out empty files and and /var/db/pkg I got this little gem:

Code: Select all

+-------+----------------------+----------------------------------------------------+
| count | hash                 | path                                               |
+-------+----------------------+----------------------------------------------------+
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-credential               |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-get-tar-commit-id        |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-restore                  |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-maintenance              |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-mktree                   |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-format-patch             |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-switch                   |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-cherry-pick              |
|   141 | 13475184963877333011 | /usr/libexec/git-core/git-stash                    |
~~~/ /~~~
+-------+----------------------+----------------------------------------------------+

That's 141 * 3.6MB = ~500MB of a single git binary.
.
.
.
.

Or is it?

Code: Select all

# du -hs /usr/libexec/git-core/
23M	/usr/libexec/git-core/
# ls -nlh /usr/libexec/git-core/git-credential
-rwxr-xr-x 141 0 0 3,6M 05-05 20:34 /usr/libexec/git-core/git-credential

Gotta say, they had me in the first part.

BTW, bonus tip: if you're populating your database with a million records for the first time, do that before creating indices.
Creating index from scratch afterwards is way faster than updating it after every single insert.

Edit:

That said, I also didn't realize that (apparently?) an actual index can't contain a subset of the data it references?

It's implementation dependent, e.g. in case of innodb primary key contains all data from that table, so it _can_, but you don't really need to know this until you get to the point where you need to squeeze every last drop of performance out of your database. Pointers are much more common case.
What you do need to know is that a good index (one with flat distribution) makes reads referring it much faster, but every index on a table makes writes slower.

Edit2:

So, an index that would contain only st_size where count > 1.

Wasn't counting identical files a query?
You don't get an index on a query, you use index to optimize a query.
Look at this:

Code: Select all

MariaDB [myfiles]> select count(*) as count, xxh as hash from file group by xxh order by count desc limit 10;
+-------+----------------------+
| count | hash                 |
+-------+----------------------+
| 13883 | 17241709254077376921 |
|  2421 | 15455416940746275070 |
|  2355 |  7798227106775105808 |
|  1206 |   212914472182500658 |
|  1195 | 11647564342387378732 |
|  1000 |  3961001529558505388 |
|   869 |  9332523865658125927 |
|   817 |  5477814735019410036 |
|   797 |   950620316618548815 |
|   705 | 14542161136692776343 |
+-------+----------------------+
10 rows in set (0,263 sec)

MariaDB [myfiles]> explain select count(*) as count, xxh as hash from file group by xxh order by count desc limit 10;
+------+-------------+-------+-------+---------------+------+---------+------+--------+----------------------------------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                                        |
+------+-------------+-------+-------+---------------+------+---------+------+--------+----------------------------------------------+
|    1 | SIMPLE      | file  | index | NULL          | xxh  | 8       | NULL | 822015 | Using index; Using temporary; Using filesort |
+------+-------------+-------+-------+---------------+------+---------+------+--------+----------------------------------------------+
1 row in set (0,000 sec)

I asked to count files with the same hash and display results ordered by count.
This query uses index on column xxh to count it. Easy task, since all identical files are always in consecutive records, so we stop counting as soon as we hit a new hash value.

Without index the same query needs 2.7s up from 0.26s to run with a new execution plan:
MariaDB [myfiles]> explain select count(*) as count, xxh as hash from file ignore index(xxh) group by xxh order by count desc limit 10;

Code: Select all

+------+-------------+-------+------+---------------+------+---------+------+--------+---------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra                           |
+------+-------------+-------+------+---------------+------+---------+------+--------+---------------------------------+
|    1 | SIMPLE      | file  | ALL  | NULL          | NULL | NULL    | NULL | 822015 | Using temporary; Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+--------+---------------------------------+

It's an order of magnitude speedup from just limiting randomness. Both cases walk over _all_ records in the table, there is no difference in the amount of data accessed.

Last edited by szatox on Sat May 13, 2023 10:32 pm, edited 2 times in total.

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Sat May 13, 2023 10:00 pm

wjb wrote:I still think there are only a few jobs the python has to do:

Collect the list of all files/attributes (not hash)

Given a list of files, generate a list of their hashes

Output the result of the final report query

1 is done, at least as much as it can be at this point. Since it works, I consider the rest to be refinement as needs dictate what changes would be necessary.

2. is next on my list. I didn't get around to doing anything yesterday, but maybe tonight.

3. Will depend on what I use for a UI. I'm leaning toward starting with a terminal version. For testing, I built a basic forward / backward "web-style" pager for the command line using unrelated sample data. It works decently enough as a starting point, so I just need to put a wrapper around and work on how to handle user input for filtering.

wjb wrote:The SQL side can/should do the rest. Remember that INSERT/UPDATE/DELETE can all use a SELECT to get a list of values (tuples) to operate on, see the SQLite tramline diagrams, "expr" can be "(select .......)". This is back to front compared to doing it in python, where you'd do the select first and loop over the result doing inserts/updates/deletes as you go, but the point is that the database can do it more efficiently (in multiple threads where possible).

I'll eventually need to handle updates, and for that, I don't yet have an idea how that would work. Before I get into that, I'll need to figure out how to get results only related to some subset of a larger directory hierarchy.

wjb wrote:I've actually just been through this to add an sqlite-powered index to my collection of scanned letters/bills, previously organised only by a file naming convention. This has also involved sorting out differences in on-disk and in-database versions of reality.

Yeah, the difference between db and disk has been a concern. But the data I'm most interested in addressing first is dormant, so the disk/db issue isn't an urgent concern.

wjb wrote:SQLiteStudio can be very convenient for trying out complicated SQL compared to using the sqlite command line.

I'll take a look. I've not had good luck when I've had to install qt libraries. But it's been a while since I've tried, so I'm due for a reminder as to why I keep uninstalling anything related to qt :)

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Sat May 13, 2023 10:14 pm

szatox wrote:This answers question what those 7 and 20 byte files are, and possibly the other legions of identical non-empty files.

I didn't check quantity, but the first files I noticed as duplicates where icons of some sort, which made some sense as to why the number of duplicates seemed so high. I guessed that a lot of files might be close, but not exactly the same size in bytes, but would take the same amount of disk space. I know a lot of the data is duplicated, I just didn't expect as much. Although when I get rid of most of the OS related backups, I'll be curious to see how the information changes.

szatox wrote:Gotta say, they had me in the first part.

Sparse files?

szatox wrote:BTW, bonus tip: if you're populating your database with a million records for the first time, do that before creating indices.
Creating index from scratch afterwards is way faster than updating it after every single insert.

Fortunately the use is mostly short term, so I don't know how much I'll need to index. I wonder how many changes would warrant deleting the index and re-creating it after the changes have completed.

Quis separabit? Quo animo?

szatox: Advocate; Posts: 3858; Joined: Tue Aug 27, 2013 12:35 pm

Quote

Post by szatox » Sat May 13, 2023 10:46 pm

Sparse files?

Hard links. 141 names point to the same git executable. Everyone else would have used softlinks here, but I guess git is special

I wonder how many changes would warrant deleting the index and re-creating it after the changes have completed.

Importing a database, recovering from backup, stuff like that. If your update takes less than a few minutes, it is not worth the time you'll spend typing the commands. However, sometimes you can turn days of downtime into hours of downtime, and this does make a difference.

Added some extra info about indexing to my previous post.

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Sun May 14, 2023 6:00 am

szatox wrote:
Sparse files?
Hard links. 141 names point to the same git executable. Everyone else would have used softlinks here, but I guess git is special :lol:

I wasn't sure what 141 referenced. That does seem like a lot of links.

szatox wrote:Importing a database, recovering from backup, stuff like that. If your update takes less than a few minutes, it is not worth the time you'll spend typing the commands. However, sometimes you can turn days of downtime into hours of downtime, and this does make a difference.

Added some extra info about indexing to my previous post.

For UI responsiveness, I wondered if dropping an index and recreating it might be faster. Sounds like that's not likely to be necessary.

szatox wrote:Wasn't counting identical files a query?
You don't get an index on a query, you use index to optimize a query.

It was a query. What I was thinking though was that an index could be created to contain only the relevant records, which in theory should produce faster results.

Although now that you mention it, creating the temporary table is unnecessary because all I'm doing is query that for the results, which I would already get from the query. Nevertheless, if indexes worked the way I'm suggesting, having a table with 1.7m entries and an index on "duplicate files" reduced, then my query against that index would produce only the results I was wanting. That was the idea anyway.

I finally tested the query / temporary table creation, and it works. On a small data sample its really quick, but that's only on 229 records total.

I need to clean up the test code and try on real data. Eliminating creation of the temporary table will help some, but I will still have what I think you or someone else may have suggested which was querying the sizes, then iterating over them. I think that should be doable with a nested(?) query if that's the correct term. A select for the duplicates, then a select on those results which gets the relevant file data to perform the hash.

Quis separabit? Quo animo?

pjp: Administrator; Posts: 20668; Joined: Tue Apr 16, 2002 10:35 pm

Quote

Post by pjp » Sun May 14, 2023 3:37 pm

Despite several attempts to make this more difficult than it is, I finally figured out what should have been apparent much earlier:

I had been creating a temporary table of st_size, which wasn't necessary. I was then query that table for the list of sizes, followed by querying the primary table for all records of each size. I think it was 3 queries to do what the one above does in one. Oh well.

Code: Select all

SELECT id,st_size,fq_pathname FROM {tn} GROUP BY st_size HAVING COUNT(*) > 1;

The st_size isn't needed in the list of selected fields, but I included it for visual results. The results appear correct without it. That then allows me to reference the record id to insert a hash into the primary table.

Thanks again!

Quis separabit? Quo animo?

Post Reply

21 posts • Page 1 of 1

Return to “Portage & Programming”