Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
lseek(..., SEEK_HOLE) giving ENOENT
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Sun Nov 19, 2023 10:32 am    Post subject: lseek(..., SEEK_HOLE) giving ENOENT Reply with quote

I have chromium installed. Lately, just for it I'm setting PORTAGE_TMPDIR to somewhere with plenty of space (I normally have it in /tmp, but there just isn't enough RAM for this gargantuan package), and now it's failing at the install stage with the following output:

Code:
Failed to copy file: _parsed_options=Namespace(group=-1, owner=-1, mode=420, preserve_timestamps=False), source=b'/mnt/bigdata/portag
e/www-client/chromium-119.0.6045.123/temp/README.gentoo', dest_dir=b'/mnt/bigdata/portage/www-client/chromium-119.0.6045.123/image/us
r/share/doc/chromium-119.0.6045.123/.'
Traceback (most recent call last):
  File "/usr/lib/portage/python3.11/doins.py", line 197, in run
    copyfile(source, dest)
  File "/usr/lib/python3.11/site-packages/portage/util/file_copy/__init__.py", line 29, in _optimized_copyfile
    _file_copy(src_file.fileno(), dst_file.fileno())
FileNotFoundError: [Errno 2] No such file or directory

There doesn't seem to be anything wrong with README.gentoo, nor with the filesystem as far as I can determine. The file exists, but even if I repeat the emerge with FEATURES=keepwork, this still happens. Strangely enough, I've been able to get the package installed with "ebuild ... install" and "ebuild ... qmerge", so there seems to be something with the environment emerge sets up that triggers this.

I've tracked down where the error is being generated to the function do_lseek_data() in src/portage_util_file_copy_reflink_linux.c in portage-3.0.55, line 169:

Code:
    offset_hole = lseek(fd_in, offset_data, SEEK_HOLE);

For some reason, just for this one file, this call is failing and setting errno = ENOENT. lseek() is never supposed to generate that error, right? Maybe a bug in glibc or the kernel? I tried to debug the kernel in a virtual machine, but the failure didn't happen. Any suggestions?
Back to top
View user's profile Send private message
sam_
Developer
Developer


Joined: 14 Aug 2020
Posts: 1678

PostPosted: Mon Nov 20, 2023 10:30 am    Post subject: Reply with quote

What filesystem is PORTAGE_TMPDIR on in this instance?
Back to top
View user's profile Send private message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Mon Nov 20, 2023 11:10 am    Post subject: Reply with quote

Ext4. And if it matters, features are has_journal ext_attr dir_index filetype meta_bg extent 64bit flex_bg inline_data sparse_super large_file huge_file dir_nlink extra_isize metadata_csum

Also, in that virtual machine I mentioned, I also used ext4, though with defaults. I might try it again with matching features. And use a kernel with the same config as well.
Back to top
View user's profile Send private message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Wed Nov 22, 2023 6:17 pm    Post subject: Reply with quote

I've made progress in debugging this. First of all, it's happening with other packages, e.g. dev-qt/qtgui. After debugging the kernel and filesystem for a while, I found the following.

  • This only happens when feature inline_data is enabled, because the affected code first checks for this before checking for inline data being present.
  • The problem file for dev-qt/qtgui is temp/qtgui-qconfig.h, which is 207 bytes, too large to be inline. The on-disk inode is good. It has i_flags == 0x80000 (EXT4_EXTENTS_FL), as should be expected.
  • However, for some reason, in memory, the flags (in struct ext4_inode_info) appear to be messed up. I found i_flags == 0x2210000000, which includes EXT4_INLINE_DATA_FL. The kernel code determines the size of the inline data (in this case, 111 bytes) to be less than the file size, tries to find more, and can't, so it returns -ENOENT.
  • I haven't yet determined why i_flags has a bad value.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21651

PostPosted: Wed Nov 22, 2023 6:29 pm    Post subject: Reply with quote

With what kernel version(s) have you observed this? Have you found a simpler reproducer than running emerge affected-package? If this is only observed with emerge (so far), does disabling the Portage sandbox have any effect? As a user library, the sandbox should not be causing this, but it could be causing the relevant programs to use different system calls than the non-sandbox path, allowing them to hit a kernel bug that is not seen when using the non-sandbox path.

If you mount a tmpfs on PORTAGE_TMPDIR, does the emerge then succeed? I would expect so since this appears to be in the ext4 fs code, but if not, that would suggest a more general VFS problem and that ext4 is just where the error finally becomes noticeable.
Back to top
View user's profile Send private message
grknight
Retired Dev
Retired Dev


Joined: 20 Feb 2015
Posts: 1666

PostPosted: Wed Nov 22, 2023 6:51 pm    Post subject: Reply with quote

Fun quote from Ted Tso at https://bugzilla.kernel.org/show_bug.cgi?id=200681#c5 :
Quote:
There are in fact a number of ways that inline_data is broken for files --- in particular when a file transitions from a small file that fits inside the inode, to a large file. However, when I looked at your reproducer, it did *not* appear to be one of the cases which I would have expected to be problematic.

So when I tried it in my KVM framework, I was not surprised when it worked for me. That being said, as I've said, there are definitely problems with inline_data --- there is a reason why it is not enabled by default!

(For example, generic/477 is broken for the inline case, which can be easily seen by running "kvm-xfstests -c inline generic/477". For another, there are some cases where if you crash immediately after a inline->regular conversion, the file contents might not be correct.)

This is from 2018 so I don't know if things have improved or not.
Back to top
View user's profile Send private message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Wed Nov 22, 2023 7:54 pm    Post subject: Reply with quote

Hu wrote:
With what kernel version(s) have you observed this?

6.6.1-gentoo (all custom builds, BTW, though the latest ones I'm using in the VM come from "make defconfig" followed by a few tweaks)

Quote:
Have you found a simpler reproducer than running emerge affected-package?

Not yet, but I'm experimenting with separately creating small files and running the lseek() call in question, now that I have a handle on the issue.

Quote:
but it could be causing the relevant programs to use different system calls than the non-sandbox path, allowing them to hit a kernel bug that is not seen when using the non-sandbox path.

The fact that using ebuild directly is getting around this does support that.

Quote:
If you mount a tmpfs on PORTAGE_TMPDIR, does the emerge then succeed?

As I said at the start, I normally have it on /tmp, which is indeed a tmpfs, and I've already run emerge there with thousands of builds, and even a couple with chromium, so I consider that check already done.
Back to top
View user's profile Send private message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Wed Nov 22, 2023 11:21 pm    Post subject: Reply with quote

Okay, I think I have a simple way to reproduce this.
grknight wrote:
Fun quote from Ted Tso at https://bugzilla.kernel.org/show_bug.cgi?id=200681#c5 :

Thank you for this. It was very helpful for being able to reliably reproduce the bug.
Quote:
but it could be causing the relevant programs to use different system calls than the non-sandbox path, allowing them to hit a kernel bug that is not seen when using the non-sandbox path.

Turns out turning off sandboxing makes no difference.

The real way to trigger this is as follows:

  1. The filesystem must be ext4 with inline_data, of course.
  2. Create a new file and write no more than the maximum to make it inline.
  3. In one or more separate calls, write more data to the file, such that it must become non-inline.
  4. Before the file is synced, do lseek(..., SEEK_HOLE) on it.

Just to be clear, reading from the file at step 4 seems to work just fine.

Another thing that seems to matter slightly is the size of the filesystem. I find that if it's just a couple of MB, the bug may not be reproducible. Just something to keep in mind when testing. I found a 32 MiB volume reproduces it reliably.

And here is a small program to demonstrate.
Back to top
View user's profile Send private message
ddawson
n00b
n00b


Joined: 24 Jul 2018
Posts: 21
Location: United States

PostPosted: Fri Nov 24, 2023 7:38 pm    Post subject: Reply with quote

One can also see the bad flags just by running this command line:
Code:
$ rm -f test-file; dd if=/dev/zero of=test-file bs=64 count=3 status=none; lsattr test-file

Output should be
Code:
--------------e------- test-file

but is actually
Code:
------------------N--- test-file

until the file is synced.

A workaround is to sync the file before trying to map its holes. I find the following effective for portage:
Code:
diff -ur portage-3.0.55/bin/doins.py portage-3.0.55-patched/bin/doins.py
--- portage-3.0.55/bin/doins.py 2023-11-06 13:39:48.000000000 -0800
+++ portage-3.0.55-patched/bin/doins.py 2023-11-24 11:20:19.453537283 -0800
@@ -194,6 +194,9 @@
             if e.errno != errno.ENOENT:
                 raise
         try:
+            # Work around stale inline-data flag in ext4
+            with open(source, "r") as fd:
+                os.fdatasync(fd)
             copyfile(source, dest)
             _set_attributes(self._parsed_options, dest)
             if self._copy_xattr:

Next up is to see about getting the kernel fixed.
Back to top
View user's profile Send private message
Spacey
n00b
n00b


Joined: 07 Apr 2024
Posts: 1

PostPosted: Sun Apr 07, 2024 9:11 pm    Post subject: Reply with quote

For what it is worth, I am running into this too. Many months ago I couldn't figure this out, so I just started excluding that one package from updates. But then a 2nd package came along. And then after a profile update, I wanted to emerge world and found a 3rd affected package. hddtemp, qemu, and edk2-ovmf-bin are the ones I ran into.

They all broke with an Errno 2 on copying README.gentoo ... just like yours. I tested on 6.1.12 and 6.6.21 and both had the same issue. I did an strace, and ended up deciding this was the point where things broke:
Code:
lseek(3, 0, SEEK_HOLE)            = -1 ENOENT (No such file or directory)

I have two very similar systems here, but one is NAS focused and one is router focused, and only the NAS one has this issue. The NAS one tends to have more filesystems and features enabled. For a long time I couldn't figure out what was different that could be breaking this, but then I found this thread. Sure enough, my NAS system has inline_data enabled on the ext4 root filesystem but the router doesn't. Bingo!

As a hack to work around this, I tried:
Code:
mount -t tmpfs -o mode=1777 portagetmp /var/tmp/portage/

and sure enough, I no longer have problems emerging these packages.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum