Anacron - Daily and Weekly job conflicts

Message

lyallp · Post by **lyallp** » Mon Feb 28, 2022 1:38 am

I use anacron with /etc/cron.daily and /etc/cron.weekly directories populated with scripts that I execute.

My problem is, one of my weekly scripts is a long running disk test script (I have bad sectors on one of my disks and I want to keep an eye on the disk)

This conflicts with a daily job (emerge --sync) in which the long running script will generate errors as files go 'missing' in portage that where there when the weekly script started.

This could easily be a weekly backup, rather than a disk check, it's the concept that matters.

Can I setup such that anacron runs all it's daily/weekly/monthly scripts in series - or make it such that my long-running script will prevent shorter running scripts from running until it is finished?

Is there some 'lockfile' my jobs can create to prevent other jobs from running?

I guess I could replace the run-parts with my own script which scans for executable files in the indicated directory and runs them in series, but I felt that this was something someone else may have already done.

I would create files in /etc/cron.daily, for example, as '01-script.parallel' and '01-script.serial', execute all the parallel ones first, when they are complete iterate through the serials....

Have my script 'lock' any other instances so when anacron executes the daily and the weekly and monthly all at the execution time, including the command line delay, which is a hack, the daily, weekly, monthly invocations would wait for each other and execute in series...

Alternatively is there a different package that does what I am looking for?

Thoughts much appreciated.

Marlo · Post by **Marlo** » Mon Feb 28, 2022 8:03 am

Hello lyallp,
You can play with RANDOM_DELAY and START_HOURS_RANGE. See the man page
https://linux.die.net/man/5/anacrontab.
Greetings
Ma

lyallp · Post by **lyallp** » Mon Feb 28, 2022 9:45 am

Random delay doesn't cut the mustard.

I have written a simplistic Run-Parts.sh which searches a directory for scripts with 'serial' in the filename and executes them one after the other, waiting til one finishes before starting the next.

It will run scripts with 'parallel' in the name, in parallel.

The remaining scripts are executed in parallel, basically current functionality of run-parts, as I understand it.

The script only allows one instance of itself to run at a time, waiting till other instances finish, before resuming, thus, daily, weekly and monthly scripts won't interfere with each other.

I simply replaced the anacron invocations of run-parts with my Run-Parts.sh.

A lot of missing functionality compared to run-parts but good enough for my purposes.

For me, this works well with a weekly script that I have that takes many hours but has non-functional conflicts with some of my daily scripts.

It completely removes the need for the random delay or delay before weekly, monthly steps.

For example, each month, I run a disk check, which takes ages, whilst daily, I do an emerge --sync, the appearance and disappearance of files during the disk check generates diagnostics.

szatox · Post by **szatox** » Mon Feb 28, 2022 9:52 am

Is there some 'lockfile' my jobs can create to prevent other jobs from running?

Flock.
You can use it to invoke a command after acquiring an exclusive (write) lock on a file, or you can open a file inside a script and then flock it.
Either way, lock is inherited by all children and will be automatically removed when all processes spawned under flock terminate.

If lock can't be acquired within a user-defined timeout, flock fails without starting the command provided.
Bonus: Parallel tasks can use non-exclusive (read) locks. They don't block each other, but do prevent write locks from serial tasks.

lyallp · Post by **lyallp** » Mon Feb 28, 2022 10:01 am

Cool, will look into it...

Edit: Well, I did look into it, flock doesn't quite fit my requirements, still useful in the script, however.

Post by Hu » Mon Feb 28, 2022 7:20 pm

Why is your disk check causing files to go in and out of existence? A healthy disk should never lose files, and the disk check should not be modifying filesystems on the device.

szatox · Post by **szatox** » Mon Feb 28, 2022 8:49 pm

Hu, I suppose he unmounts the FS before fsck.

lyallp, what's wrong with flock? Answering this question could inform other possible answers.

Post by Hu » Mon Feb 28, 2022 10:03 pm

Perhaps so. I wouldn't use fsck in an automated manner on a drive known to have failing sectors. I would use a SMART test, if I kept the drive in service at all. SMART is nondisruptive and can be done while the filesystem is mounted.

lyallp · Post by **lyallp** » Mon Feb 28, 2022 10:27 pm

You asked for background ....

My long running job is a disk check does not use fsck.

The disk that is failing is some spinning iron, and the number of hard errors is not unbearable at this time.
Its a 6TB drive and is split into half NTFS filesystem and half XFS.

The XFS part, has data which is not crucial, such as portage and other dynamic data plus I have backups.

What I do is find every file on the disk, cat/dd it to null and if there is an error, move it to a folder called BAD_SECTORS.
New errors do not happen often, but I wrote the script when I first found I had bad sectors on the disk via 'syslog' and I had great difficulty in associating a kernel level bad sector with human level file.
So, I had a bad sector, what file was affected? Bugger it, I will read all the contents of all files to find the affected files or, in some cases, directories.

I don't bother trying to recover the file or directory which has the error, I just put it to one side so it does not interfere again, with things like portage.
Sectors that are broken that are not contained within a file or directory structure are not found in this test. I did consider using 'dd' to fill the disk with nulls to try find these but my care factor wandered away.

I also emerge --sync every day, so the emerge --sync and the disk check conflict.

One of these days, I will replace the disk, it's not urgent.

So, back to anacron.

I use anacron to schedule daliy/weekly/monthly jobs and one daily job is emerge --sync and another weekly job is bad sector scans on the failing drive.

I was having reports of missing files that the bad sector scan could not find, as a result of emerge --sync deleting files...

The problem, for me, is flock seemed to assume that I wanted to run a command when the lock was achieved and release the lock when the command completed.

I wanted a lock which acquired a lock and a separate? command to release the lock, whilst I did stuff in between.

I perfectly understand putting 'flock' in each script that is run by anacron in the /etc/cron.daily, for example.
Make the decision parallel or serial in each script.

I wanted to have the lock at a higher level, at the actual execution of the whole directory of /etc/cron.daily, for example.

The running script, would prevent other instances of itself from running (say monthly and weekly), forcing them to wait till daily has completed.

I guess I could put 'flock' at the /etc/anacrontab level when it runs the run-parts script, just make the run-parts a parameter of 'flock'.

Then, I wanted individual scripts in the daily to also be run serially or parallel, for example, I have a weekly backup which takes ages.

I do not want other jobs to run whilst this backup is happening, for example.

However, I do have other jobs that I am happy to run in parallel, clean up scripts, etc.

It just seemed easier to me to write my own run-parts script.

To that end, flock is useful in the case of executing the individual scripts from this run-script and but it is not so useful in serialising daily, weekly, monthly, although, having thought about it, using flock at the /etc/anacrontab level achieves some of the result I am looking for.

I do now use flock to serialise steps within my run-parts script now, and I think I may even use flock on /etc/anacron level but having filenames that indicate parallel or serial running seems more intuitive to me than having flock in each script, requiring you to examine each script to determine if it is running in parallel or serially.

Anyway, that's some background on the issue, and thoughts on flock when it comes to paralellising/serialising anacron jobs.

figueroa · Post by **figueroa** » Tue Mar 01, 2022 5:37 am

Would you be kind enough to share your script that does " find every file on the disk, cat/dd it to null and if there is an error, move it to a folder called BAD_SECTORS"?

lyallp · Post by **lyallp** » Tue Mar 01, 2022 7:21 am

Sure, it's nothing fancy....

The script finds the files, I move them manually after looking at what the file is and whether I need to do anything.

Code: Select all

#!/bin/bash
#
## #######################################################################################
## Cat ALL files from a directory specified (/data if none specified).
## to /dev/null, to look for bad sectors.
## IGNORE directory BAD_SECTORS and all it's contents.
## Also, this script will not report bad sectors that are in the actual directory structure
## as opposed to files that contain bad sectors
## #######################################################################################
#
for dirToCheck in / /home /var /tmp /mnt/c_drive /mnt/e_drive /data
do
    if [ ! -d "${dirToCheck}" ]
    then
	echo "$0: Expect parameter >${dirToCheck}< to be a directory that exists."
	exit 1
    fi

    cd "${dirToCheck}" || ( echo "$0: Failed to cd to >${dirToCheck}<" && exit 2 )

    echo "Checking ${dirToCheck} for bad sectors."
    # send error output of find to /dev/null as it does traverse the BAD_SECTORS folder (if any)
    # during it's search.
    # Ignore those errors.
    #
    find . -mount -type f 2> /dev/null |
	grep -v 'BAD_SECTORS' |
	while read f ;
	do
	    cat "${f}" > /dev/null || echo "${f} has problems" ;
	done
done
exit 0

Post by Hu » Tue Mar 01, 2022 4:40 pm

Rather than grep -v for BAD_SECTORS, you could use a find expression that prohibits examining that directory at all.

Code: Select all

find . -mount -name BAD_SECTORS -prune -o -type f -print0 | while read -d '' f; do

This lets you avoid discarding the output of find, which might be desirable, since as-is, you will not see it report broken directories outside the BAD_SECTORS area.

I note that your cat test can produce confusing output. If you have two files with the same paths under their respective mount points, you cannot tell from the output which one the script is reporting.

szatox · Post by **szatox** » Tue Mar 01, 2022 6:08 pm

The problem, for me, is flock seemed to assume that I wanted to run a command when the lock was achieved and release the lock when the command completed.
I wanted a lock which acquired a lock and a separate? command to release the lock, whilst I did stuff in between.

Well, the easy way (running script under flock) does acquire lock, then runs script.
If you want to release the lock early, you have to do that the other way around: run script directly, then have it open a file, flock it, execute whatever needs to be serialized, and finally close the file. Check the manual for syntax (EXAMPLES section)

You can't make the lock linger after terminating the script. However, you can lock the file, read it, decide whether or not you want the script to continue, and update to inform future invocations before releasing the lock.

I do not want other jobs to run whilst this backup is happening, for example.
However, I do have other jobs that I am happy to run in parallel, clean up scripts, etc.

Write lock a marker file before running backup, read lock the same file before running other tasks. Read locks are social and will be happy to share that file, write lock goes all in or all out.

The running script, would prevent other instances of itself from running (say monthly and weekly), forcing them to wait till daily has completed.

well, you can do read locks on one file to allow parallel execution of several tasks when backup (with write lock) is not running, and a write lock on another file to prevent running multiple instances at the same time. But it makes me think you're overengineering it. Throw another layer, to ensure daily won't run at the same time as weekly, and you're definitely overengineering it.
What is the actual thing that bothers you there? Do those scripts conflict with each other? Do they consume too much resources? Maybe you should probe the system for its status instead?
Maybe batch is actually a better option? (E.g. running a task when load average is below a predefined value)

Finally, you really should just replace that big failing disk.

lyallp · Post by **lyallp** » Tue Mar 01, 2022 10:10 pm

Script, I print the mount point I am scanning before the find, so that distinguishes files.

The error output of find whilst scanning bad sectors is re-directed to stderr, in the script above, however the expression is point taken. I do have BAD_SECTORS_# where # is 0-6 so far....

Yes, I should replace the disk but it only has about 12 bad sectors so far and the number has not changed in quite some time.

Locking, upon reflection, thanks for that feedback. I do use flock now