Gentoo Forums :: View topic - optimsing find handling in bash script

optimsing find handling in bash script

View unanswered posts
View posts from last 24 hours

Goto page 1, 2 Next

Gentoo Forums Forum Index

Portage & Programming

View previous topic :: View next topic

Author

Message

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Thu Aug 09, 2018 2:37 pm Post subject: optimsing find handling in bash script

Greetings,

I have a script that runs over a files tree, select specific files and prepares a list of sums from them, then it runs on another few file trees and adds files that are note in the list to the tree it searched before.
for the first loop I use this loop:

Code:

for file in $(find ${sub_target_folder} | egrep -i "${extensions}"); do ...

and for the second loop, I use this find:

Code:

for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

running the script on images for example can take abit of time.
is there a way to parallel the work? I've looked into parallel and xargs -P but I don't think they work as I need.
I assume that using the -name switch is faster than the egrep, is it possible to make the file name case insensitive?

Thanks,

Dagg.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Thu Aug 09, 2018 2:38 pm Post subject:

apparently the -iname does what I need, I thought -iname relates to inode. so strike out the second question. _________________ Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

khayyam
Watchman

Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

Posted: Thu Aug 09, 2018 8:26 pm Post subject: Re: optimsing find handling in bash script

DaggyStyle wrote:

Code:

for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

DaggyStyle ... this could probably be improved in a number of ways, firstly, you could avoid the 'grep -v' by pruning the path:

Code:

find path1 path2 path3 -path "$sub_target_folder" -prune -o -printf "%T@ %p\n"

Secondly, I suspect that the 'egrep -i "$extentions"' is for selecting file extentions, if that's the case then '-regextype posix-extended -type f -regex '.*.(jpeg|mp4|doc)' or similar could be used in its place (though by the sound of "running the script on images" you're passing "$extentions" to the script as a variable, and so would need to translate that into a regex).

Probably best you post the whole script as others can probably make similar suggestions.

best ... khay

toralf
Developer

Joined: 01 Feb 2004
Posts: 3922
Location: Hamburg

Posted: Thu Aug 09, 2018 8:43 pm Post subject:

I do prefer in the mean while sth like

Code:

find ... | while read f; do ... done

Code:

while; do ... done < <(find ...)

to avoid reaching the maximum command line length.

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Aug 10, 2018 6:20 am Post subject: Re: optimsing find handling in bash script

khayyam wrote:

DaggyStyle wrote:

Code:

for file in $(find path1 path2 path3 -printf "%T@ %p\n" 2>/dev/null | grep -v ${sub_target_folder} | egrep -i "${extensions}" | sort -n | cut -d ' ' -f 2-); do

DaggyStyle ... this could probably be improved in a number of ways, firstly, you could avoid the 'grep -v' by pruning the path:

Code:

find path1 path2 path3 -path "$sub_target_folder" -prune -o -printf "%T@ %p\n"

does the regex supports case insensitive?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Aug 10, 2018 6:22 am Post subject:

toralf wrote:

I do prefer in the mean while sth like

Code:

find ... | while read f; do ... done

Code:

while; do ... done < <(find ...)

to avoid reaching the maximum command line length.

good idea, will change.

thanks.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Aug 10, 2018 6:28 am Post subject:

I'm not using the extensions anymore, I've replaced it with this:

Code:

searches="$(echo -n "-iname \"*."; echo "$*" | tr '\n' ' ' | sed 's/ /" -o -iname \"*./g'; echo "\"")"

this replaces the egrep.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

khayyam
Watchman

Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

Posted: Fri Aug 10, 2018 9:57 am Post subject: Re: optimsing find handling in bash script

DaggyStyle wrote:

does the regex supports case insensitive?

DaggyStyle ... yes, with '-iregex'.

Code:

-regextype posix-extended -iregex '.*.(jpeg|mp4|doc)'

DaggyStyle wrote:

I'm not using the extensions anymore, I've replaced it with this:

Code:

searches="$(echo -n "-iname \"*."; echo "$*" | tr '\n' ' ' | sed 's/ /" -o -iname \"*./g'; echo "\"")"

this replaces the egrep.

I really don't understand what you're trying to do here ... that's basically the following hardcoded string '-iname "*." -o -iname "*."'

best ... khay

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Aug 10, 2018 4:38 pm Post subject:

this is the current version:

Code:

#!/bin/bash

LIST_FN="/tmp/list"
TARGET_FOLDER="/mnt/storage/personal"
REMOVE_ORIGIN=0
FILES_ADDED=0
FILES_PROCESSED=0
DEFAULT_USER="dagg"
DEFAULT_GROUP="users"

function orgenize_media {
local category="$1"
local extensions
local sub_target_folder="${TARGET_FOLDER}/${category}"
local cmd

shift
searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"

rm -rf ${LIST_FN}
touch ${LIST_FN}

   echo "${category}: compiling existing files list, please wait..."
   cmd="find ${sub_target_folder} -type f "${searches}""
while read file; do
   sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}
done < <(eval ${cmd})

cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

echo "${category}: comparing files, please wait..."
while read file; do
   if [ ! -f "${file}" ]; then
      continue
   fi

   FILES_PROCESSED=$((FILES_PROCESSED+1))
   sum=$(sha256sum "${file}" | awk '{print $1}')
   grep -q ${sum} ${LIST_FN}
   if [ $? -eq 1 ]; then
      echo "placing ${file} to it's proper location at ${sub_target_folder}"

      target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"
      mkdir -p "$(dirname "${target}")"
      chown ${DEFAULT_USER}.${DEFAULT_GROUP} "$(dirname "${target}")"

      if [ -f "${target}" ]; then
         echo "file ${file} (${sum}) exists at ${target} ($(sha256sum "${target}" | awk '{print $1}')), exiting."
         exit 1
      fi

      cp -p "${file}" "${target}"
      chown ${DEFAULT_USER}.${DEFAULT_GROUP} "${target}"
      echo ${sum} >> ${LIST_FN}
      FILES_ADDED=$((FILES_ADDED+1))
   fi

   if [ ${REMOVE_ORIGIN} -eq 1 ]; then
      rm -rf ${file}
   fi
done < <(eval ${cmd})
}

orgenize_media videos mkv mov mp4 avi
orgenize_media pictures jpg png gif jpeg

echo "Summery: ${FILES_PROCESSED} files were processed of them ${FILES_ADDED} where added to ${TARGET_FOLDER}."

for 65k files it scans, it takes 160 minutes to scan without the copy (average case). I thought of using files and sems to make it parallel as there isn't really an issue with order within the loops
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Anon-E-moose
Watchman

Joined: 23 May 2008
Posts: 6098
Location: Dallas area

Posted: Fri Aug 10, 2018 5:02 pm Post subject:

I'm curious, have you tried coding it in perl or python to see if it will speed up significantly? _________________ PRIME x570-pro, 3700x, 6.1 zen kernel gcc 13, profile 17.0 (custom bare multilib), openrc, wayland

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Aug 10, 2018 5:49 pm Post subject:

Anon-E-moose wrote:

I'm curious, have you tried coding it in perl or python to see if it will speed up significantly?

nope, due to my limited time to work on this, I selected bash as I'm not that prolific in python and as I see it, in perl it will take too long to implement.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Hu
Moderator

Joined: 06 Mar 2007
Posts: 21633

Posted: Sat Aug 11, 2018 4:52 am Post subject:

DaggyStyle wrote:

Code:

3 LIST_FN="/tmp/list"

Predictable file names in /tmp are often a security problem. On principle, I recommend avoiding them.

DaggyStyle wrote:

Code:

18 searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"

This could be much simpler. See below.

DaggyStyle wrote:

Code:

20 rm -rf ${LIST_FN}

No need for -rf here.

DaggyStyle wrote:

Code:

21 touch ${LIST_FN}

No need for this, if you ensure the file is written at least once. See below.

DaggyStyle wrote:

Code:

23 echo "${category}: compiling existing files list, please wait..."
24 cmd="find ${sub_target_folder} -type f "${searches}""
25    while read file; do
26       sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}
27    done < <(eval ${cmd})

Why use eval here? There are some rare legitimate uses for it, but in my opinion, any time you use eval, you need a good reason to overcome the risks associated with using it improperly. I would rewrite this section as:

Code:

local searches=( -iname "*.$1" )
shift
while [[ -n "$1" ]]; do
searches+=( -o -iname "*.$1" )
shift
done
find "$sub_target_folder" -type f $ "${searches[@]}" $ -print0 | xargs -0r sha256sum | gawk '{print $1;}' > "$LIST_FN"

This also gives you a chance to make the sha256sum parallel by changing the invocation of xargs. I find it a little odd that you compute the sums, then discard the associated names.

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Aug 11, 2018 6:30 am Post subject:

Hu wrote:

DaggyStyle wrote:

Code:

3 LIST_FN="/tmp/list"

Predictable file names in /tmp are often a security problem. On principle, I recommend avoiding them.

only one user will run this file (root) and it will run at most 2 times a day. I think a predictable file name is ok in this matter.

Hu wrote:

DaggyStyle wrote:

Code:

18 searches="$(echo -n "-regextype posix-extended -iregex '.*.("; echo "$*" | xargs | sed 's/ /|/g' | tr -d '\n'; echo ")'")"

This could be much simpler. See below.

it seems that this line always can be optimized

Hu wrote:

DaggyStyle wrote:

Code:

20 rm -rf ${LIST_FN}

No need for -rf here

the -r is not needed here, thanks.

Hu wrote:

DaggyStyle wrote:

Code:

21 touch ${LIST_FN}

No need for this, if you ensure the file is written at least once. See below.

it is simpler to create a empty file than always test if it exists or not before accessing it.

Hu wrote:

DaggyStyle wrote:

Code:

Why use eval here?

running ${cmd} resulted in a find error because it surrounds the content of ${searches} with quotes. eval sorted it out.

Hu wrote:

There are some rare legitimate uses for it, but in my opinion, any time you use eval, you need a good reason to overcome the risks associated with using it improperly. I would rewrite this section as:

Code:

This also gives you a chance to make the sha256sum parallel by changing the invocation of xargs.

I've tried xargs before but it always ended after n processes, I've never been able to get it to read all the input and use multiple threads

Hu wrote:

I find it a little odd that you compute the sums, then discard the associated names.

thats because I can have file content duplication, e.g. same content, different name. saving the sum is what I need.

thanks for the help, I'll look into it.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

khayyam
Watchman

Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

Posted: Sat Aug 11, 2018 9:34 am Post subject:

DaggyStyle wrote:

this is the current version:

DaggyStyle ... now we're getting somewhere :) First tentative thoughts, why isn't rsync suitable here? I imagine there are various wrappers/scripts arround rsync that do this sort of comparison and merging. Anyhow ... in addition to Hu's comments:

DaggyStyle wrote:

Code:

sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}

I really don't think sha256sum is needed here, you don't need cryptographic security, you need a fast, and unique, hash for comparison. Using something like dev-libs/xxhash is probably a good option, but even md5, or sha1 would save you cpu cycles, and achieve the same result. Also, you should at minimum use app-crypt/rhash here, as it provides printf output formating and so you can get rid of that 'awk {print $1}', eg:

Code:

% rhash --printf="%h\n" file
0133f131b8dc2ad015e7c4c331d6c28a2edda6ef

... '%h' is sha1, but rhash supports a variety of hashes, see the OUTPUT FORMAT OPTIONS in 'man rhash'.

DaggyStyle wrote:

Code:

cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

I wonder why you've chosen to define TARGET_FOLDER but then hardcode '~joe/'

DaggyStyle wrote:

Code:

while read file; do
if [ ! -f "${file}" ]; then
continue
fi

What are you expecting to happen here? ... the while loop is initerating "file", and we can assume that these are existant (otherwise why is 'find' providing them).

DaggyStyle wrote:

Code:

target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"

Looks like you should be using parameter expansion here, rather than 'echo |sed', eg:

Code:

% file="/usr/bin/awk" ; echo ${file##*/}
awk

HTH & best ... khay

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Aug 11, 2018 10:30 am Post subject:

khayyam wrote:

DaggyStyle wrote:

this is the current version:

DaggyStyle ... now we're getting somewhere

First tentative thoughts, why isn't rsync suitable here? I imagine there are various wrappers/scripts arround rsync that do this sort of comparison and merging. Anyhow ... in addition to Hu's comments:

I don't think rsync supports handling de duplication.
I've learnt during my years with linux that most of the time there is no proper intersection between the feature I need from a software and what I can find so I decide beforehand to implement it myself.

khayyam wrote:

DaggyStyle wrote:

]

Code:

sha256sum "${file}" | awk '{print $1}' >> ${LIST_FN}

Code:

% rhash --printf="%h\n" file
0133f131b8dc2ad015e7c4c331d6c28a2edda6ef

... '%h' is sha1, but rhash supports a variety of hashes, see the OUTPUT FORMAT OPTIONS in 'man rhash'.

I'm used to work with md5sum, but md5sum is not unique so I've went with sha256. if rhash is enough, I'll take it.

khayyam wrote:

DaggyStyle wrote:

Code:

cmd="find $(readlink -f ~joe/move\ to\ storage) /mnt/storage /mnt/share -path "${sub_target_folder}" -prune -o -type f "${searches}" -printf \"%T@ %p\n\" 2>/dev/null | sort -n | cut -d \" \" -f 2-"

I wonder why you've chosen to define TARGET_FOLDER but then hardcode '~joe/'

moving src paths into defs is optimization, I want to get it to work properly first.

khayyam wrote:

DaggyStyle wrote:

Code:

while read file; do
if [ ! -f "${file}" ]; then
continue
fi

What are you expecting to happen here? ... the while loop is initerating "file", and we can assume that these are existant (otherwise why is 'find' providing them).

most of the files are in unicode , spaces and non english chjars, I've to to a situation where the path was bad resulting with a non existent file, as I know the folder in question is not important, I've added a temp workaround to bypass it as it is hard to pinpoint the path.

khayyam wrote:

DaggyStyle wrote:

Code:

target="${sub_target_folder}/$(echo "${file}" | sed 's/^.*\/storage\///g;s/^.*\/share\///g')"

Looks like you should be using parameter expansion here, rather than 'echo |sed', eg:

Code:

% file="/usr/bin/awk" ; echo ${file##*/}
awk

HTH & best ... khay

looks like all I need is to remove the first 2-3 folders, so I guess cut can do the job too. thanks for the tip.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

mv
Watchman

Joined: 20 Apr 2005
Posts: 6747

Posted: Sat Aug 11, 2018 10:43 am Post subject:

You might want to have a look at patchdirs (contained in dev-util/mv_perl from the mv overlay).

szatox
Advocate

Joined: 27 Aug 2013
Posts: 3136

Posted: Sat Aug 11, 2018 12:06 pm Post subject:

Quote:

I don't think rsync supports handling de duplication.

It does support file-level deduplication with --link-dest.
AFAIK it doesn't support block level deduplication. However, if you have a FS that supports copy-on-write, you can create a linked copy and then update it with rsync using --inplace flag. Besides overwriting destination file directly instead of creating temporary file and renaming, this flag enables rsync delta algorithm (as long as you run it over network) which will discover differences and only update changed blocks.

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Aug 11, 2018 12:45 pm Post subject:

szatox wrote:

Quote:

I don't think rsync supports handling de duplication.

all my work is done locally, somehow I get the feeling that this suggestions is way overhead from what I need. thanks for the suggestion thought
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Aug 11, 2018 12:47 pm Post subject:

mv wrote:

You might want to have a look at patchdirs (contained in dev-util/mv_perl from the mv overlay).

from the description, it isn't what In need.
I have several media file (movies an pics) on several different folders, I want to unify them under a single location.
in addition, I want to keep scanning these location incase new files will be added and automatically move them to the right location.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

mv
Watchman

Joined: 20 Apr 2005
Posts: 6747

Posted: Sat Aug 11, 2018 1:21 pm Post subject:

DaggyStyle wrote:

I want to unify them under a single location.
in addition, I want to keep scanning these location incase new files will be added and automatically move them to the right location.

It is not so clear what you mean by unify (locating dupes and hard-link them?), but except for the moving itself, I think most things are covered. Instead of moving, you will just get as output a list of "missing" files in the new location. However, I agree that it is hard to understand how patchdirs can be used for such "nonstandard" situations.
Anyway, I would always recommend to write such things with "find" where you need hashes and sooner or later more tricky structures in perl: With perl's File::Find you have much finer control over pruning, loops by symbolic links, etc, and no problems with filenames containing embedded newlines etc. (For the standard find utility you are forced to use -exec or things like quoter and then still have problems with argument lengths or that pipes are in a separate process and thus cannot export variables etc.)

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Nov 10, 2018 9:11 am Post subject:

back to this topic, I've taken the python path, this is what I have:

Code:

#!/usr/bin/env python

import os, multiprocessing, threading, fnmatch, hashlib, timeit, time
from typing import NamedTuple
from shutil import copyfile
from queue import *

BLOCKSIZE = 65536;
DEBUG = 0;
MOVE_FILE = 0;

jobs = int(multiprocessing.cpu_count() * 3 / 4) - 1;
root_path = "/mnt/storage/personal";
src_roots = [ '~user1/move\ to\ storage', '/mnt/storage', '/mnt/share' ];

timeit.template = """
def inner(_it, _timer{init}):
{setup}
_t0 = _timer()
for _i in _it:
retval = {stmt}
_t1 = _timer()
return _t1 - _t0, retval
"""

def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs);

return wrapped;

class cat_set(NamedTuple):
name: str
extensions: list

class worker(threading.Thread):
def __init__(self, tid, cb, source, sink, cat):
threading.Thread.__init__(self);
self.tid = tid;
self.source = source;
self.sink = sink;
self.cb = cb;
self.ret_val = 0;
self.category = cat;

def run(self):
if DEBUG > 0:
print("thread " + str(self.tid) + ": working");

self.ret_val = thread_task(self.tid, self.cb, self.source, self.sink, self.category);

if DEBUG > 0:
print("thread " + str(self.tid) + ": work is done.");

return self.ret_val;

def sha1_file(file):
hasher = hashlib.sha1();
f = open(file, 'rb');
buf = f.read(BLOCKSIZE);

while len(buf) > 0:
hasher.update(buf);
buf = f.read(BLOCKSIZE);

return hasher.hexdigest();

def handle_scanned_file_cb(idx, existing_files_pool, filename, none):
conflict = 0;
hash = str(sha1_file(filename));
if DEBUG > 0:
print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ")");

if hash in existing_files_pool:
print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
conflict = 1;
else:
existing_files_pool[hash] = filename;

return conflict;

def handle_sync_file_cb(idx, existing_files_pool, filename, category):
hash = str(sha1_file(filename));
if DEBUG > 0:
print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ")");

if not hash in existing_files_pool:
existing_files_pool[hash] = filename;
dst = root_path + "/" + str(category) + filename[filename.index("/", filename.index("/", 1) + 1):]

if DEBUG > 0:
print("thread " + str(idx) + ": copying " + str(filename) + " to " + str(dst));

if not os.path.exists(os.path.dirname(dst)):
try:
os.makedirs(os.path.dirname(dst));
except FileExistsError:
print(str(os.path.dirname(dst)) + " exists already, skipping.");

copyfile(filename, dst);

if MOVE_FILE:
os.remove(filename);
dir = os.path.dirname(filename);
if not os.listdir(dir):
os.rmdir(dir);
else:
if str(os.path.basename(filename)) == str(os.path.basename(existing_files_pool[hash])) and MOVE_FILE:
os.remove(filename);
dir = os.path.dirname(filename);
if not os.listdir(dir):
os.rmdir(dir);
else:
print("thread " + str(idx) + ": got " + str(filename) + " hash (" + hash + ") which exists in the list as " + str(existing_files_pool[hash]) +", ignoring file.");

return 0;

def handle_files(idx, cb, source, sink, category):
conflicts = 0;
processed_items_count = 0;
start = time.time();
process = True;

while process:
if DEBUG > 1:
print("thread " + str(idx) + ": count = " + str(source.qsize()));
try:
element = source.get_nowait();
processed_items_count =+ 1;
if DEBUG > 1:
print("thread " + str(idx) + ": is handling");
conflicts += cb(idx, sink, element, category);

except Empty:
elapsed = time.time() - start
if elapsed > 5 and not processed_items_count:
print("thread " + str(idx) + ": no file provided.");
process = False;
elif processed_items_count:
process = False;

if conflicts:
print("thread " + str(idx) + ": found " + str(conflicts) + " conflicts.");

return conflicts;

def thread_task(idx, cb, source, sink, category):
print("thread " + str(idx) + ": is up");

status = handle_files(idx, cb, source, sink, category);

print("thread " + str(idx) + ": is done");

return status;

def handle_extensions(extensions):
final_extensions = [];

for ext in extensions:
final_extensions.append("*." + ext.lower());
final_extensions.append("*." + ext.upper());

return final_extensions;

def get_files_from_subtree(root_folders, extensions):
for folder in root_folders:
for dirpath, dirnames, filenames in os.walk(folder):
for fn in filenames:
if any(fnmatch.fnmatch(fn, w) for w in extensions):
path = os.path.join(dirpath, fn)
stat = os.lstat(os.path.normpath(path)) # lstat fails on some files without normpath
yield stat.st_ctime, path # Yield file

def sync_src_tree_to_dest(idx, category, extensions, files_queue):
act_paths = [ ];

for path in src_roots:
act_paths.append(os.path.abspath(path));

for ctime, path in sorted(get_files_from_subtree(act_paths, extensions), reverse=False):
if not path.startswith(root_path):
files_queue.put(path);

def scan_dst_tree(idx, category, extensions, files_queue):
root_folder = root_path + "/" + category;

for ctime, path in get_files_from_subtree([ root_folder ], extensions):
if DEBUG > 1:
print("thread " + str(idx) + ": adding " + path + " to queue");
files_queue.put(path);

def handle_data_action(idx, category, extensions, tree_scan_cb, handle_file_cb, action):
final_extensions = handle_extensions(extensions);
threads = set();
files_queue = Queue();
files_hash_map = dict();
ret_val = 0;

print("thread " + str(idx) + ": " + action + " " + category);
for i in range(jobs):
thread = worker(i + 1, handle_file_cb, files_queue, files_hash_map, category);
threads.add(thread);
thread.start();

if DEBUG > 0:
print("thread " + str(idx) + ": adding files");

tree_scan_cb(idx, category, final_extensions, files_queue);

if DEBUG > 1:
print("thread " + str(idx) + ": count = " + str(files_queue.qsize()));
if DEBUG > 0:
print("thread " + str(idx) + ": waiting for queue to deplete");

while True:
if not len(threads):
break;
thread = threads.pop();
print("thread " + str(idx) + ": joining child thread " + str(thread.tid) + " until it finishes.");
thread.join();
print("thread " + str(idx) + ": child thread " + str(thread.tid) + " has finished. (ret val is " + str(thread.ret_val) + ")");
if thread.ret_val:
ret_val = 1;

print("thread " + str(idx) + ": done.");
print("thread " + str(idx) + ": " + action + " finished.");

return ret_val;

def handle_category(idx, category, extensions):
wrapped = wrapper(handle_data_action, idx, category, extensions, scan_dst_tree, handle_scanned_file_cb, "scanning");
duration, ret_val = timeit.timeit(wrapped, number = 1);
print('thread {:d}: {:s} scan took {:.3f} seconds'. format(idx, category, duration));

if not ret_val:
wrapped = wrapper(handle_data_action, idx, category, extensions, sync_src_tree_to_dest, handle_sync_file_cb, "syncing");
duration, ret_val = timeit.timeit(wrapped, number = 1);
print('thread {:d}: {:s} sync took {:.3f} seconds'. format(idx, category, duration));

cats_set = [
cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];

for cat in cats_set:
handle_category(0, cat.name, cat.extensions);

it does what I need, the problem is the time it takes to create the lists. especially the process that calculates the sha1 for each file.
I'l appreciate and inputs on the code in general and how to speedup the info gathering step.
I know I can maintain an exiting list but I'd have to validate it each time I'm running the script and that equals to create it from scratch.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Hu
Moderator

Joined: 06 Mar 2007
Posts: 21633

Posted: Sat Nov 10, 2018 6:15 pm Post subject:

DaggyStyle wrote:

Code:

BLOCKSIZE = 65536;

No need for the trailing semicolon.

DaggyStyle wrote:

Code:

def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs);
return wrapped;

This body could be rewritten as:

Code:

return lambda: func(*args, **kwargs)

DaggyStyle wrote:

Code:

def sha1_file(file):

Naming arguments after Python keywords is legal, but produces confusing syntax highlighting in some editors. Avoid it where practical.

DaggyStyle wrote:

Code:

f = open(file, 'rb');
buf = f.read(BLOCKSIZE);

while len(buf) > 0:
hasher.update(buf);
buf = f.read(BLOCKSIZE);

You should use with. It would be better not to repeat yourself, although Python lacks a good way of doing so. You could rewrite this as:

Code:

with open(filename, 'rb') as f:
   while True:
      buf = f.read(BLOCKSIZE)
      if not buf:
         break
      hasher.update(buf)

You should be able to rewrite it in a way that combines the while condition and the truth test, but that is only possible if you have the very new := assignment or if you fake it by using a dummy iterator.

DaggyStyle wrote:

Code:

conflict = 0;

No need for a single-return function here.

DaggyStyle wrote:

Code:

hash = str(sha1_file(filename));

sha1_file returns hashlib.hexdigest, which is already a string. No need to str it again.

DaggyStyle wrote:

Code:

if hash in existing_files_pool:
print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
conflict = 1;

You could read the value back as part of the lookup. Rewrite as:

Code:

h = existing_files_pool.get(hash)
if h is not None:
print("thread %s: %s hash (%s) is already in list (%s)" % (idx, filename, hash, h))
return 1

DaggyStyle wrote:

Code:

def sync_src_tree_to_dest(idx, category, extensions, files_queue):
act_paths = [ ];

for path in src_roots:
act_paths.append(os.path.abspath(path));

Code:

act_paths = [os.path.abspath(path) for path in src_roots]

DaggyStyle wrote:

Code:

while True:
if not len(threads):
break;

Code:

while threads:

DaggyStyle wrote:

Code:

cats_set = [
cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];

This may return pictures that are not of cats.

DaggyStyle wrote:

As a first step, check that you never digest the same file more than once. Your multi-level callback design makes it hard to say quickly whether that could happen. If it did, fixing it would be an easy performance win.

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Fri Nov 16, 2018 3:01 pm Post subject:

Hu wrote:

DaggyStyle wrote:

Code:

BLOCKSIZE = 65536;

No need for the trailing semicolon.

DaggyStyle wrote:

Code:

def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs);
return wrapped;

This body could be rewritten as:

Code:

return lambda: func(*args, **kwargs)

DaggyStyle wrote:

Code:

def sha1_file(file):

Naming arguments after Python keywords is legal, but produces confusing syntax highlighting in some editors. Avoid it where practical.

DaggyStyle wrote:

Code:

f = open(file, 'rb');
buf = f.read(BLOCKSIZE);

while len(buf) > 0:
hasher.update(buf);
buf = f.read(BLOCKSIZE);

You should use with. It would be better not to repeat yourself, although Python lacks a good way of doing so. You could rewrite this as:

Code:

with open(filename, 'rb') as f:
   while True:
      buf = f.read(BLOCKSIZE)
      if not buf:
         break
      hasher.update(buf)

DaggyStyle wrote:

Code:

conflict = 0;

No need for a single-return function here.

DaggyStyle wrote:

Code:

hash = str(sha1_file(filename));

sha1_file returns hashlib.hexdigest, which is already a string. No need to str it again.

DaggyStyle wrote:

Code:

if hash in existing_files_pool:
print("thread " + str(idx) + ": " + str(filename) + " hash (" + hash + ") is already in list (" + existing_files_pool[hash] + ")");
conflict = 1;

You could read the value back as part of the lookup. Rewrite as:

Code:

h = existing_files_pool.get(hash)
if h is not None:
print("thread %s: %s hash (%s) is already in list (%s)" % (idx, filename, hash, h))
return 1

DaggyStyle wrote:

Code:

def sync_src_tree_to_dest(idx, category, extensions, files_queue):
act_paths = [ ];

for path in src_roots:
act_paths.append(os.path.abspath(path));

Code:

act_paths = [os.path.abspath(path) for path in src_roots]

DaggyStyle wrote:

Code:

while True:
if not len(threads):
break;

Code:

while threads:

DaggyStyle wrote:

Code:

cats_set = [
cat_set("videos", [ "mkv", "mov", "mp4", "avi" ]),
cat_set("pictures", [ "jpg", "png", "gif", "jpeg" ]),
];

This may return pictures that are not of cats.

DaggyStyle wrote:

the logic is rather simple:

scan the target folder for the relevant files and digest each of them.
scan the source folders, for each relevant file, if it's digest doesn't exists in the target folder's digest list, move it to the target folder and add it to the digest list.

the digestion takes time, maybe the dict is the issue as it isn't hasted or something
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Hu
Moderator

Joined: 06 Mar 2007
Posts: 21633

Posted: Fri Nov 16, 2018 11:12 pm Post subject:

I understood the intent to be as you describe. My point was that I can't tell from the code whether it actually works like that unless I trace through multiple levels of callback. If it doesn't work like you say it should, that could easily cost substantial performance. Python dictionaries are optimized, because they are used extensively. Even if they were not, the performance cost of the most naive implementation of a dictionary would likely be tiny compared to the performance cost of hashing large numbers of large files.

DaggyStyle
Watchman

Joined: 22 Mar 2006
Posts: 5909

Posted: Sat Nov 17, 2018 6:21 am Post subject:

Hu wrote:

I understood the intent to be as you describe. My point was that I can't tell from the code whether it actually works like that unless I trace through multiple levels of callback. If it doesn't work like you say it should, that could easily cost substantial performance.

Python dictionaries are optimized, because they are used extensively. Even if they were not, the performance cost of the most naive implementation of a dictionary would likely be tiny compared to the performance cost of hashing large numbers of large files.

ok, so to better know what is the issue, I need to remove the cbs?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Display posts from previous:

	Gentoo Forums Forum Index Portage & Programming	All times are GMT Goto page 1, 2 Next
Page 1 of 2

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Copyright 2001-2024 Gentoo Foundation, Inc. Designed by Kyle Manna © 2003; Style derived from original subSilver theme. | Hosting by Gossamer Threads Inc. © | Powered by phpBB 2.0.23-gentoo-p11 © 2001, 2002 phpBB Group
Privacy Policy