Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
optimsing find handling in bash script
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Thu Aug 09, 2018 2:37 pm    Post subject: optimsing find handling in bash script Reply with quote

Greetings,

I have a script that runs over a files tree, select specific files and prepares a list of sums from them, then it runs on another few file trees and adds files that are note in the list to the tree it searched before.
for the first loop I use this loop:
Code:
for file in $(find ${sub_target_folder} | egrep  -i "${extensions}"); do ...

and for the second loop, I use this find:
Code:
for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

running the script on images for example can take abit of time.
is there a way to parallel the work? I've looked into parallel and xargs -P but I don't think they work as I need.
I assume that using the -name switch is faster than the egrep, is it possible to make the file name case insensitive?

Thanks,

Dagg.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Thu Aug 09, 2018 2:38 pm    Post subject: Reply with quote

apparently the -iname does what I need, I thought -iname relates to inode. so strike out the second question.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Thu Aug 09, 2018 8:26 pm    Post subject: Re: optimsing find handling in bash script Reply with quote

DaggyStyle wrote:
Code:
for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

DaggyStyle ... this could probably be improved in a number of ways, firstly, you could avoid the 'grep -v' by pruning the path:

Code:
find path1 path2 path3 -path "$sub_target_folder" -prune -o -printf "%T@ %p\n"

Secondly, I suspect that the 'egrep -i "$extentions"' is for selecting file extentions, if that's the case then '-regextype posix-extended -type f -regex '.*.(jpeg|mp4|doc)' or similar could be used in its place (though by the sound of "running the script on images" you're passing "$extentions" to the script as a variable, and so would need to translate that into a regex).

Probably best you post the whole script as others can probably make similar suggestions.

best ... khay
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3922
Location: Hamburg

PostPosted: Thu Aug 09, 2018 8:43 pm    Post subject: Reply with quote

I do prefer in the mean while sth like
Code:
find ... | while read f; do ... done
or
Code:
while; do ... done < <(find ...)
to avoid reaching the maximum command line length.
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Aug 10, 2018 6:20 am    Post subject: Re: optimsing find handling in bash script Reply with quote

khayyam wrote:
DaggyStyle wrote:
Code:
for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

DaggyStyle ... this could probably be improved in a number of ways, firstly, you could avoid the 'grep -v' by pruning the path:

Code:
find path1 path2 path3 -path "$sub_target_folder" -prune -o -printf "%T@ %p\n"

Secondly, I suspect that the 'egrep -i "$extentions"' is for selecting file extentions, if that's the case then '-regextype posix-extended -type f -regex '.*.(jpeg|mp4|doc)' or similar could be used in its place (though by the sound of "running the script on images" you're passing "$extentions" to the script as a variable, and so would need to translate that into a regex).

Probably best you post the whole script as others can probably make similar suggestions.

best ... khay

does the regex supports case insensitive?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Aug 10, 2018 6:22 am    Post subject: Reply with quote

toralf wrote:
I do prefer in the mean while sth like
Code:
find ... | while read f; do ... done
or
Code:
while; do ... done < <(find ...)
to avoid reaching the maximum command line length.

good idea, will change.

thanks.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Aug 10, 2018 6:28 am    Post subject: Reply with quote

I'm not using the extensions anymore, I've replaced it with this:
Code:
searches="$(echo -n "-iname \"*."; echo "$*" | tr '\n' ' ' | sed 's/ /" -o -iname \"*./g'; echo "\"")"

this replaces the egrep.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Fri Aug 10, 2018 9:57 am    Post subject: Re: optimsing find handling in bash script Reply with quote

DaggyStyle wrote:
does the regex supports case insensitive?

DaggyStyle ... yes, with '-iregex'.

Code:
-regextype posix-extended -iregex '.*.(jpeg|mp4|doc)'

DaggyStyle wrote:
I'm not using the extensions anymore, I've replaced it with this:
Code:
searches="$(echo -n "-iname \"*."; echo "$*" | tr '\n' ' ' | sed 's/ /" -o -iname \"*./g'; echo "\"")"

this replaces the egrep.

I really don't understand what you're trying to do here ... that's basically the following hardcoded string '-iname "*." -o -iname "*."'

best ... khay
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Aug 10, 2018 4:38 pm    Post subject: Reply with quote

this is the current version:
Code:

#!/bin/bash

LIST_FN="/tmp/list"
TARGET_FOLDER="/mnt/storage/personal"
REMOVE_ORIGIN=0
FILES_ADDED=0
FILES_PROCESSED=0
DEFAULT_USER="dagg"
DEFAULT_GROUP="users"

function orgenize_media {
   local category="$1"
   local extensions
   local sub_target_folder="${TARGET_FOLDER}/${category}"
   local cmd

   shift
   searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"

   rm -rf ${LIST_FN}
   touch ${LIST_FN}

    echo "${category}: compiling existing files list, please wait..."
    cmd="find ${sub_target_folder} -type f "${searches}""
   while read file; do
      sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}
   done < <(eval ${cmd})

   cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

   echo "${category}: comparing files, please wait..."
   while read file; do
      if [ ! -f "${file}" ]; then
         continue
      fi

      FILES_PROCESSED=$((FILES_PROCESSED+1))
      sum=$(sha256sum "${file}" | awk '{print $1}')
      grep -q ${sum} ${LIST_FN}
      if [ $? -eq 1 ]; then
         echo "placing ${file} to it's proper location at ${sub_target_folder}"

         target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"
         mkdir -p "$(dirname "${target}")"
         chown ${DEFAULT_USER}.${DEFAULT_GROUP} "$(dirname "${target}")"

         if [ -f "${target}" ]; then
            echo "file ${file} (${sum}) exists at ${target} ($(sha256sum "${target}" | awk '{print $1}')), exiting."
            exit 1
         fi

         cp -p "${file}" "${target}"
         chown ${DEFAULT_USER}.${DEFAULT_GROUP} "${target}"
         echo ${sum} >> ${LIST_FN}
         FILES_ADDED=$((FILES_ADDED+1))
      fi

      if [ ${REMOVE_ORIGIN} -eq 1 ]; then
         rm -rf ${file}
      fi
   done < <(eval ${cmd})
}

orgenize_media videos mkv mov mp4 avi
orgenize_media pictures jpg png gif jpeg

echo "Summery: ${FILES_PROCESSED} files were processed of them ${FILES_ADDED} where added to ${TARGET_FOLDER}."


for 65k files it scans, it takes 160 minutes to scan without the copy (average case). I thought of using files and sems to make it parallel as there isn't really an issue with order within the loops
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6098
Location: Dallas area

PostPosted: Fri Aug 10, 2018 5:02 pm    Post subject: Reply with quote

I'm curious, have you tried coding it in perl or python to see if it will speed up significantly?
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Aug 10, 2018 5:49 pm    Post subject: Reply with quote

Anon-E-moose wrote:
I'm curious, have you tried coding it in perl or python to see if it will speed up significantly?

nope, due to my limited time to work on this, I selected bash as I'm not that prolific in python and as I see it, in perl it will take too long to implement.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21633

PostPosted: Sat Aug 11, 2018 4:52 am    Post subject: Reply with quote

DaggyStyle wrote:
Code:
     3   LIST_FN="/tmp/list"
Predictable file names in /tmp are often a security problem. On principle, I recommend avoiding them.
DaggyStyle wrote:
Code:
    18      searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"
This could be much simpler. See below.
DaggyStyle wrote:
Code:
    20      rm -rf ${LIST_FN}
No need for -rf here.
DaggyStyle wrote:
Code:
    21      touch ${LIST_FN}
No need for this, if you ensure the file is written at least once. See below.
DaggyStyle wrote:
Code:
    23       echo "${category}: compiling existing files list, please wait..."
    24       cmd="find ${sub_target_folder} -type f "${searches}""
    25      while read file; do
    26         sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}
    27      done < <(eval ${cmd})
Why use eval here? There are some rare legitimate uses for it, but in my opinion, any time you use eval, you need a good reason to overcome the risks associated with using it improperly. I would rewrite this section as:
Code:
   local searches=( -iname "*.$1" )
   shift
   while [[ -n "$1" ]]; do
      searches+=( -o -iname "*.$1" )
      shift
   done
   find "$sub_target_folder" -type f \( "${searches[@]}" \) -print0 | xargs -0r sha256sum | gawk '{print $1;}' > "$LIST_FN"
This also gives you a chance to make the sha256sum parallel by changing the invocation of xargs. I find it a little odd that you compute the sums, then discard the associated names.
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Aug 11, 2018 6:30 am    Post subject: Reply with quote

Hu wrote:
DaggyStyle wrote:
Code:
     3   LIST_FN="/tmp/list"
Predictable file names in /tmp are often a security problem. On principle, I recommend avoiding them.

only one user will run this file (root) and it will run at most 2 times a day. I think a predictable file name is ok in this matter.
Hu wrote:
DaggyStyle wrote:
Code:
    18      searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"
This could be much simpler. See below.

it seems that this line always can be optimized :)
Hu wrote:
DaggyStyle wrote:
Code:
    20      rm -rf ${LIST_FN}
No need for -rf here

the -r is not needed here, thanks.
Hu wrote:
DaggyStyle wrote:
Code:
    21      touch ${LIST_FN}
No need for this, if you ensure the file is written at least once. See below.

it is simpler to create a empty file than always test if it exists or not before accessing it.
Hu wrote:
DaggyStyle wrote:
Code:
    23       echo "${category}: compiling existing files list, please wait..."
    24       cmd="find ${sub_target_folder} -type f "${searches}""
    25      while read file; do
    26         sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}
    27      done < <(eval ${cmd})
Why use eval here?

running ${cmd} resulted in a find error because it surrounds the content of ${searches} with quotes. eval sorted it out.
Hu wrote:
There are some rare legitimate uses for it, but in my opinion, any time you use eval, you need a good reason to overcome the risks associated with using it improperly. I would rewrite this section as:
Code:
   local searches=( -iname "*.$1" )
   shift
   while [[ -n "$1" ]]; do
      searches+=( -o -iname "*.$1" )
      shift
   done
   find "$sub_target_folder" -type f \( "${searches[@]}" \) -print0 | xargs -0r sha256sum | gawk '{print $1;}' > "$LIST_FN"
This also gives you a chance to make the sha256sum parallel by changing the invocation of xargs.

I've tried xargs before but it always ended after n processes, I've never been able to get it to read all the input and use multiple threads
Hu wrote:
I find it a little odd that you compute the sums, then discard the associated names.

thats because I can have file content duplication, e.g. same content, different name. saving the sum is what I need.

thanks for the help, I'll look into it.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Sat Aug 11, 2018 9:34 am    Post subject: Reply with quote

DaggyStyle wrote:
this is the current version:

DaggyStyle ... now we're getting somewhere :) First tentative thoughts, why isn't rsync suitable here? I imagine there are various wrappers/scripts arround rsync that do this sort of comparison and merging. Anyhow ... in addition to Hu's comments:

DaggyStyle wrote:
Code:
    sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}

I really don't think sha256sum is needed here, you don't need cryptographic security, you need a fast, and unique, hash for comparison. Using something like dev-libs/xxhash is probably a good option, but even md5, or sha1 would save you cpu cycles, and achieve the same result. Also, you should at minimum use app-crypt/rhash here, as it provides printf output formating and so you can get rid of that 'awk {print $1}', eg:

Code:
% rhash --printf="%h\n" file
0133f131b8dc2ad015e7c4c331d6c28a2edda6ef

... '%h' is sha1, but rhash supports a variety of hashes, see the OUTPUT FORMAT OPTIONS in 'man rhash'.

DaggyStyle wrote:
Code:
    cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

I wonder why you've chosen to define TARGET_FOLDER but then hardcode '~joe/'

DaggyStyle wrote:
Code:
    while read file; do
        if [ ! -f "${file}" ]; then
            continue
        fi

What are you expecting to happen here? ... the while loop is initerating "file", and we can assume that these are existant (otherwise why is 'find' providing them).

DaggyStyle wrote:
Code:
    target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"

Looks like you should be using parameter expansion here, rather than 'echo |sed', eg:

Code:
% file="/usr/bin/awk" ; echo ${file##*/}
awk

HTH & best ... khay
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Aug 11, 2018 10:30 am    Post subject: Reply with quote

khayyam wrote:
DaggyStyle wrote:
this is the current version:

DaggyStyle ... now we're getting somewhere :) First tentative thoughts, why isn't rsync suitable here? I imagine there are various wrappers/scripts arround rsync that do this sort of comparison and merging. Anyhow ... in addition to Hu's comments:

I don't think rsync supports handling de duplication.
I've learnt during my years with linux that most of the time there is no proper intersection between the feature I need from a software and what I can find so I decide beforehand to implement it myself.
khayyam wrote:
DaggyStyle wrote:
]
Code:
    sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}

I really don't think sha256sum is needed here, you don't need cryptographic security, you need a fast, and unique, hash for comparison. Using something like dev-libs/xxhash is probably a good option, but even md5, or sha1 would save you cpu cycles, and achieve the same result. Also, you should at minimum use app-crypt/rhash here, as it provides printf output formating and so you can get rid of that 'awk {print $1}', eg:

Code:
% rhash --printf="%h\n" file
0133f131b8dc2ad015e7c4c331d6c28a2edda6ef

... '%h' is sha1, but rhash supports a variety of hashes, see the OUTPUT FORMAT OPTIONS in 'man rhash'.

I'm used to work with md5sum, but md5sum is not unique so I've went with sha256. if rhash is enough, I'll take it.
khayyam wrote:
DaggyStyle wrote:
Code:
    cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

I wonder why you've chosen to define TARGET_FOLDER but then hardcode '~joe/'

moving src paths into defs is optimization, I want to get it to work properly first.
khayyam wrote:
DaggyStyle wrote:
Code:
    while read file; do
        if [ ! -f "${file}" ]; then
            continue
        fi

What are you expecting to happen here? ... the while loop is initerating "file", and we can assume that these are existant (otherwise why is 'find' providing them).

most of the files are in unicode , spaces and non english chjars, I've to to a situation where the path was bad resulting with a non existent file, as I know the folder in question is not important, I've added a temp workaround to bypass it as it is hard to pinpoint the path.
khayyam wrote:
DaggyStyle wrote:
Code:
    target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"

Looks like you should be using parameter expansion here, rather than 'echo |sed', eg:

Code:
% file="/usr/bin/awk" ; echo ${file##*/}
awk

HTH & best ... khay

looks like all I need is to remove the first 2-3 folders, so I guess cut can do the job too. thanks for the tip.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6747

PostPosted: Sat Aug 11, 2018 10:43 am    Post subject: Reply with quote

You might want to have a look at patchdirs (contained in dev-util/mv_perl from the mv overlay).
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3136

PostPosted: Sat Aug 11, 2018 12:06 pm    Post subject: Reply with quote

Quote:
I don't think rsync supports handling de duplication.
It does support file-level deduplication with --link-dest.
AFAIK it doesn't support block level deduplication. However, if you have a FS that supports copy-on-write, you can create a linked copy and then update it with rsync using --inplace flag. Besides overwriting destination file directly instead of creating temporary file and renaming, this flag enables rsync delta algorithm (as long as you run it over network) which will discover differences and only update changed blocks.
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Aug 11, 2018 12:45 pm    Post subject: Reply with quote

szatox wrote:
Quote:
I don't think rsync supports handling de duplication.
It does support file-level deduplication with --link-dest.
AFAIK it doesn't support block level deduplication. However, if you have a FS that supports copy-on-write, you can create a linked copy and then update it with rsync using --inplace flag. Besides overwriting destination file directly instead of creating temporary file and renaming, this flag enables rsync delta algorithm (as long as you run it over network) which will discover differences and only update changed blocks.

all my work is done locally, somehow I get the feeling that this suggestions is way overhead from what I need. thanks for the suggestion thought
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Aug 11, 2018 12:47 pm    Post subject: Reply with quote

mv wrote:
You might want to have a look at patchdirs (contained in dev-util/mv_perl from the mv overlay).

from the description, it isn't what In need.
I have several media file (movies an pics) on several different folders, I want to unify them under a single location.
in addition, I want to keep scanning these location incase new files will be added and automatically move them to the right location.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6747

PostPosted: Sat Aug 11, 2018 1:21 pm    Post subject: Reply with quote

DaggyStyle wrote:
I want to unify them under a single location.
in addition, I want to keep scanning these location incase new files will be added and automatically move them to the right location.

It is not so clear what you mean by unify (locating dupes and hard-link them?), but except for the moving itself, I think most things are covered. Instead of moving, you will just get as output a list of "missing" files in the new location. However, I agree that it is hard to understand how patchdirs can be used for such "nonstandard" situations.
Anyway, I would always recommend to write such things with "find" where you need hashes and sooner or later more tricky structures in perl: With perl's File::Find you have much finer control over pruning, loops by symbolic links, etc, and no problems with filenames containing embedded newlines etc. (For the standard find utility you are forced to use -exec or things like quoter and then still have problems with argument lengths or that pipes are in a separate process and thus cannot export variables etc.)
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Nov 10, 2018 9:11 am    Post subject: Reply with quote

back to this topic, I've taken the python path, this is what I have:
Code:
#!/usr/bin/env python

import os, multiprocessing, threading, fnmatch, hashlib, timeit, time
from typing import NamedTuple
from shutil import copyfile
from queue import *

BLOCKSIZE = 65536;
DEBUG = 0;
MOVE_FILE = 0;

jobs = int(multiprocessing.cpu_count() * 3 / 4) - 1;
root_path = "/mnt/storage/personal";
src_roots = [ '~user1/move\ to\ storage',  '/mnt/storage', '/mnt/share' ];

timeit.template = """
def inner(_it, _timer{init}):
    {setup}
    _t0 = _timer()
    for _i in _it:
        retval = {stmt}
    _t1 = _timer()
    return _t1 - _t0, retval
"""

def wrapper(func, *args, **kwargs):
        def wrapped():
                return func(*args, **kwargs);

        return wrapped;

class cat_set(NamedTuple):
        name: str
        extensions: list

class worker(threading.Thread):
        def __init__(self, tid, cb, source, sink, cat):
                threading.Thread.__init__(self);
                self.tid = tid;
                self.source = source;
                self.sink = sink;
                self.cb = cb;
                self.ret_val = 0;
                self.category = cat;

        def run(self):
                if DEBUG > 0:
                        print("thread " + str(self.tid) + ": working");

                self.ret_val = thread_task(self.tid, self.cb, self.source, self.sink, self.category);

                if DEBUG > 0:
                        print("thread " + str(self.tid) + ": work is done.");

                return self.ret_val;

def sha1_file(file):
        hasher = hashlib.sha1();
        f =  open(file, 'rb');
        buf = f.read(BLOCKSIZE);

        while len(buf) > 0:
                hasher.update(buf);
                buf = f.read(BLOCKSIZE);

        return hasher.hexdigest();

def handle_scanned_file_cb(idx, existing_files_pool, filename, none):
        conflict = 0;
        hash = str(sha1_file(filename));
        if DEBUG > 0:
                print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ")");

        if hash in existing_files_pool:
                print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
                conflict = 1;
        else:
                existing_files_pool[hash] = filename;

        return conflict;

def handle_sync_file_cb(idx, existing_files_pool, filename, category):
        hash = str(sha1_file(filename));
        if DEBUG > 0:
                print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ")");

        if not hash in existing_files_pool:
                existing_files_pool[hash] = filename;
                dst = root_path + "/" + str(category) + filename[filename.index("/", filename.index("/", 1) + 1):]

                if DEBUG > 0:
                        print("thread " + str(idx) + ": copying " + str(filename) + " to " + str(dst));

                if not os.path.exists(os.path.dirname(dst)):
                        try:
                                os.makedirs(os.path.dirname(dst));
                        except FileExistsError:
                                print(str(os.path.dirname(dst)) + " exists already, skipping.");

                copyfile(filename, dst);

                if MOVE_FILE:
                        os.remove(filename);
                        dir = os.path.dirname(filename);
                        if not os.listdir(dir):
                                os.rmdir(dir);
        else:
                if str(os.path.basename(filename)) == str(os.path.basename(existing_files_pool[hash])) and MOVE_FILE:
                        os.remove(filename);
                        dir = os.path.dirname(filename);
                        if not os.listdir(dir):
                                os.rmdir(dir);
                else:
                        print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ") which exists in the list as " + str(existing_files_pool[hash]) +", ignoring file.");

        return 0;

def handle_files(idx, cb, source, sink, category):
        conflicts = 0;
        processed_items_count = 0;
        start = time.time();
        process = True;

        while process:
                if DEBUG > 1:
                        print("thread " + str(idx) + ": count = " + str(source.qsize()));
                try:
                        element = source.get_nowait();
                        processed_items_count =+ 1;
                        if DEBUG > 1:
                                print("thread " + str(idx) + ": is handling");
                        conflicts += cb(idx, sink, element, category);

                except Empty:
                        elapsed = time.time() - start
                        if elapsed > 5 and not processed_items_count:
                                print("thread " + str(idx) + ": no file provided.");
                                process = False;
                        elif processed_items_count:
                                process = False;

        if conflicts:
                print("thread " + str(idx) + ": found " + str(conflicts) + " conflicts.");

        return conflicts;

def thread_task(idx, cb, source, sink, category):
        print("thread " + str(idx) + ": is up");

        status = handle_files(idx, cb, source, sink, category);

        print("thread " + str(idx) + ": is done");

        return status;

def handle_extensions(extensions):
        final_extensions = [];

        for ext in extensions:
                final_extensions.append("*." + ext.lower());
                final_extensions.append("*." + ext.upper());

        return final_extensions;

def get_files_from_subtree(root_folders, extensions):
        for folder in root_folders:
                for dirpath, dirnames, filenames in os.walk(folder):
                        for fn in filenames:
                                if any(fnmatch.fnmatch(fn, w) for w in extensions):
                                        path = os.path.join(dirpath, fn)
                                        stat = os.lstat(os.path.normpath(path))  # lstat fails on some files without normpath
                                        yield stat.st_ctime, path  # Yield file

def sync_src_tree_to_dest(idx, category, extensions, files_queue):
        act_paths = [ ];

        for path in src_roots:
                act_paths.append(os.path.abspath(path));

        for ctime, path in sorted(get_files_from_subtree(act_paths, extensions), reverse=False):
                if not path.startswith(root_path):
                        files_queue.put(path);

def scan_dst_tree(idx, category, extensions, files_queue):
        root_folder = root_path + "/" + category;

        for ctime, path in get_files_from_subtree([ root_folder ], extensions):
                if DEBUG > 1:
                        print("thread " + str(idx) + ": adding " + path + " to queue");
                files_queue.put(path);

def handle_data_action(idx, category, extensions, tree_scan_cb, handle_file_cb, action):
        final_extensions = handle_extensions(extensions);
        threads = set();
        files_queue = Queue();
        files_hash_map = dict();
        ret_val = 0;

        print("thread " + str(idx) + ": " + action + " " + category);
        for i in range(jobs):
                thread = worker(i + 1, handle_file_cb, files_queue, files_hash_map, category);
                threads.add(thread);
                thread.start();

        if DEBUG > 0:
                print("thread " + str(idx) + ": adding files");

        tree_scan_cb(idx, category, final_extensions, files_queue);

        if DEBUG > 1:
                print("thread " + str(idx) + ": count = " + str(files_queue.qsize()));
        if DEBUG > 0:
                print("thread " + str(idx) + ": waiting for queue to deplete");

        while True:
                if not len(threads):
                        break;
                thread = threads.pop();
                print("thread " + str(idx) + ": joining child thread " + str(thread.tid) + " until it finishes.");
                thread.join();
                print("thread " + str(idx) + ": child thread " + str(thread.tid) + " has finished. (ret val is " + str(thread.ret_val) + ")");
                if thread.ret_val:
                        ret_val = 1;

        print("thread " + str(idx) + ": done.");
        print("thread " + str(idx) + ": " + action + " finished.");

        return ret_val;

def handle_category(idx, category, extensions):
        wrapped = wrapper(handle_data_action, idx, category, extensions, scan_dst_tree, handle_scanned_file_cb, "scanning");
        duration, ret_val = timeit.timeit(wrapped, number = 1);
        print('thread {:d}: {:s} scan took {:.3f} seconds'. format(idx, category, duration));

        if not ret_val:
                wrapped = wrapper(handle_data_action, idx, category, extensions, sync_src_tree_to_dest, handle_sync_file_cb, "syncing");
                duration, ret_val = timeit.timeit(wrapped, number = 1);
                print('thread {:d}: {:s} sync took {:.3f} seconds'. format(idx, category, duration));

cats_set = [
        cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
        cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];

for cat in cats_set:
        handle_category(0, cat.name, cat.extensions);


it does what I need, the problem is the time it takes to create the lists. especially the process that calculates the sha1 for each file.
I'l appreciate and inputs on the code in general and how to speedup the info gathering step.
I know I can maintain an exiting list but I'd have to validate it each time I'm running the script and that equals to create it from scratch.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21633

PostPosted: Sat Nov 10, 2018 6:15 pm    Post subject: Reply with quote

DaggyStyle wrote:
Code:
BLOCKSIZE = 65536;
No need for the trailing semicolon.
DaggyStyle wrote:
Code:
def wrapper(func, *args, **kwargs):
        def wrapped():
                return func(*args, **kwargs);
      return wrapped;
This body could be rewritten as:
Code:
   return lambda: func(*args, **kwargs)
DaggyStyle wrote:
Code:
def sha1_file(file):
Naming arguments after Python keywords is legal, but produces confusing syntax highlighting in some editors. Avoid it where practical.
DaggyStyle wrote:
Code:
        f =  open(file, 'rb');
        buf = f.read(BLOCKSIZE);

        while len(buf) > 0:
                hasher.update(buf);
            buf = f.read(BLOCKSIZE);
You should use with. It would be better not to repeat yourself, although Python lacks a good way of doing so. You could rewrite this as:
Code:
   with open(filename, 'rb') as f:
      while True:
         buf = f.read(BLOCKSIZE)
         if not buf:
            break
         hasher.update(buf)
You should be able to rewrite it in a way that combines the while condition and the truth test, but that is only possible if you have the very new := assignment or if you fake it by using a dummy iterator.
DaggyStyle wrote:
Code:
        conflict = 0;
No need for a single-return function here.
DaggyStyle wrote:
Code:
        hash = str(sha1_file(filename));
sha1_file returns hashlib.hexdigest, which is already a string. No need to str it again.
DaggyStyle wrote:
Code:
        if hash in existing_files_pool:
                print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
                conflict = 1;
You could read the value back as part of the lookup. Rewrite as:
Code:
   h = existing_files_pool.get(hash)
   if h is not None:
      print("thread %s: %s hash (%s) is already in list (%s)" % (idx, filename, hash, h))
      return 1
DaggyStyle wrote:
Code:
def sync_src_tree_to_dest(idx, category, extensions, files_queue):
        act_paths = [ ];

        for path in src_roots:
                act_paths.append(os.path.abspath(path));
Code:
   act_paths = [os.path.abspath(path) for path in src_roots]
DaggyStyle wrote:
Code:
        while True:
                if not len(threads):
                        break;
Code:
while threads:
DaggyStyle wrote:
Code:
cats_set = [
        cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
        cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];
This may return pictures that are not of cats.
DaggyStyle wrote:
it does what I need, the problem is the time it takes to create the lists. especially the process that calculates the sha1 for each file.
I'l appreciate and inputs on the code in general and how to speedup the info gathering step.
I know I can maintain an exiting list but I'd have to validate it each time I'm running the script and that equals to create it from scratch.
As a first step, check that you never digest the same file more than once. Your multi-level callback design makes it hard to say quickly whether that could happen. If it did, fixing it would be an easy performance win.
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Fri Nov 16, 2018 3:01 pm    Post subject: Reply with quote

Hu wrote:
DaggyStyle wrote:
Code:
BLOCKSIZE = 65536;
No need for the trailing semicolon.
DaggyStyle wrote:
Code:
def wrapper(func, *args, **kwargs):
        def wrapped():
                return func(*args, **kwargs);
      return wrapped;
This body could be rewritten as:
Code:
   return lambda: func(*args, **kwargs)
DaggyStyle wrote:
Code:
def sha1_file(file):
Naming arguments after Python keywords is legal, but produces confusing syntax highlighting in some editors. Avoid it where practical.
DaggyStyle wrote:
Code:
        f =  open(file, 'rb');
        buf = f.read(BLOCKSIZE);

        while len(buf) > 0:
                hasher.update(buf);
            buf = f.read(BLOCKSIZE);
You should use with. It would be better not to repeat yourself, although Python lacks a good way of doing so. You could rewrite this as:
Code:
   with open(filename, 'rb') as f:
      while True:
         buf = f.read(BLOCKSIZE)
         if not buf:
            break
         hasher.update(buf)
You should be able to rewrite it in a way that combines the while condition and the truth test, but that is only possible if you have the very new := assignment or if you fake it by using a dummy iterator.
DaggyStyle wrote:
Code:
        conflict = 0;
No need for a single-return function here.
DaggyStyle wrote:
Code:
        hash = str(sha1_file(filename));
sha1_file returns hashlib.hexdigest, which is already a string. No need to str it again.
DaggyStyle wrote:
Code:
        if hash in existing_files_pool:
                print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
                conflict = 1;
You could read the value back as part of the lookup. Rewrite as:
Code:
   h = existing_files_pool.get(hash)
   if h is not None:
      print("thread %s: %s hash (%s) is already in list (%s)" % (idx, filename, hash, h))
      return 1
DaggyStyle wrote:
Code:
def sync_src_tree_to_dest(idx, category, extensions, files_queue):
        act_paths = [ ];

        for path in src_roots:
                act_paths.append(os.path.abspath(path));
Code:
   act_paths = [os.path.abspath(path) for path in src_roots]
DaggyStyle wrote:
Code:
        while True:
                if not len(threads):
                        break;
Code:
while threads:
DaggyStyle wrote:
Code:
cats_set = [
        cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
        cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];
This may return pictures that are not of cats.
DaggyStyle wrote:
it does what I need, the problem is the time it takes to create the lists. especially the process that calculates the sha1 for each file.
I'l appreciate and inputs on the code in general and how to speedup the info gathering step.
I know I can maintain an exiting list but I'd have to validate it each time I'm running the script and that equals to create it from scratch.
As a first step, check that you never digest the same file more than once. Your multi-level callback design makes it hard to say quickly whether that could happen. If it did, fixing it would be an easy performance win.


the logic is rather simple:

  1. scan the target folder for the relevant files and digest each of them.
  2. scan the source folders, for each relevant file, if it's digest doesn't exists in the target folder's digest list, move it to the target folder and add it to the digest list.

the digestion takes time, maybe the dict is the issue as it isn't hasted or something
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21633

PostPosted: Fri Nov 16, 2018 11:12 pm    Post subject: Reply with quote

I understood the intent to be as you describe. My point was that I can't tell from the code whether it actually works like that unless I trace through multiple levels of callback. If it doesn't work like you say it should, that could easily cost substantial performance.

Python dictionaries are optimized, because they are used extensively. Even if they were not, the performance cost of the most naive implementation of a dictionary would likely be tiny compared to the performance cost of hashing large numbers of large files.
Back to top
View user's profile Send private message
DaggyStyle
Watchman
Watchman


Joined: 22 Mar 2006
Posts: 5909

PostPosted: Sat Nov 17, 2018 6:21 am    Post subject: Reply with quote

Hu wrote:
I understood the intent to be as you describe. My point was that I can't tell from the code whether it actually works like that unless I trace through multiple levels of callback. If it doesn't work like you say it should, that could easily cost substantial performance.

Python dictionaries are optimized, because they are used extensively. Even if they were not, the performance cost of the most naive implementation of a dictionary would likely be tiny compared to the performance cost of hashing large numbers of large files.

ok, so to better know what is the issue, I need to remove the cbs?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum