Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Pattern match and regexp match can't handle this character?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Tue Feb 19, 2019 3:55 pm    Post subject: Pattern match and regexp match can't handle this character? Reply with quote

I stumbled up on this zip (a Doom2 wad).

I've been using app-arch/unzip to unpack the archive.

The file names clearly end with '.txt' and '.wad'. But find won't match them with -name/-iname. Also trying find's -regex/-iregex won't help (I know you need to match the file path entirely).

Then I started to invertigate. I ran find and then tried to grep it's output to see how it behaves...
The character doesn't even match '.' (dot) regular expression. I even tried something like this:
Code:
find . -regextype egrep -iregex '([.]|[^.])*'
... which tells me that the character kinda breaks pattern matching and regular expression matching too.

But just at the "last moment". I decided to try to set LC_ALL="C" and it works.

Somebody tell me what's going on here?
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Tue Feb 19, 2019 4:08 pm    Post subject: Reply with quote

I can't download the file. Maybe it contains an illegal UTF-8 character in a filename?
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Tue Feb 19, 2019 4:42 pm    Post subject: Reply with quote

You're correct.
I just found out it's '\ufffd'.

I think I'll create something that renames those files with illegal characters...
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Tue Feb 19, 2019 6:48 pm    Post subject: Reply with quote

app-misc/detox
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Tue Feb 19, 2019 9:38 pm    Post subject: Reply with quote

Ant P. wrote:
app-misc/detox
Thanks. I think it beats perl-rename. :)
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Tue Feb 19, 2019 9:58 pm    Post subject: Reply with quote

I didn't get -r working for detox. But with find ... -exec detox ... {} + it's perfectly usable.

Now I need to learn it a little more. Why haven't I heard of this before? :(
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
bunder
Bodhisattva
Bodhisattva


Joined: 10 Apr 2004
Posts: 5934

PostPosted: Wed Feb 20, 2019 2:15 am    Post subject: Reply with quote

Found this in a PDF about migrating from nfs3 to nfs4...

Quote:
Internationalization support; UTF-8
NFSv4 uses UTF-8 for file names, directories, symlinks and user and group identifiers. As UTF-8 is
backwards compatible with 7 bit encoded ASCII, any names that are 7 bit ASCII will continue to work.
However, pre-existing names that contain 8 bit characters will be misinterpreted by NFSv4 as UTF-8
multibyte characters, which may result in errors such as not finding files.

For example, an NFSv3 file created with the name René contains an 8 bit ASCII character in the last
position. NFSv4 will assume that the é indicates a multibyte UTF-8 encoding, which will lead to
unexpected results.

•Action:
review existing NFSv3 names to ensure that they are 7 bit ASCII clean.


maybe the filename encoding predates proper internationalization support.
_________________
Neddyseagoon wrote:
The problem with leaving is that you can only do it once and it reduces your influence.

banned from #gentoo since sept 2017
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Wed Feb 20, 2019 2:41 am    Post subject: Reply with quote

bunder wrote:
maybe the filename encoding predates proper internationalization support.
Well... It's a Doom2 wad... :D
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum