Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
PostgreSQL FTS: treat special char as punctuation/spaces
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Off the Wall
View previous topic :: View next topic  
Author Message
VinzC
Advocate
Advocate


Joined: 17 Apr 2004
Posts: 4630
Location: Theux (Belgium)

PostPosted: Wed Sep 12, 2012 3:03 pm    Post subject: PostgreSQL FTS: treat special char as punctuation/spaces Reply with quote

Hi all.

Does anybody know how I can configure PostgreSQL to treat special chars (such as slashes, for instance) like punctuation characters or even spaces? The reason is when indexed documents contain terms separated with only a [forward] slash, e.g. "term1/term2", a query against term1 or term2 doesn't return anything. One has to include both terms and the slash to return something, i.e. "term1/term2".

Thanks in advance for any hint or suggestion.

EDIT: Joke apart, this question is *not* ridiculous! The expression term1/term2 may *not* be treated as one single word, which means Postgres lexer fails at this. Instead it treats term1/term2 as a file name occurrence, defeating full-text searches against either of these terms.

SELECT alias, description, token FROM ts_debug('english/spanish'):
 alias |    description    |      token     
-------+-------------------+-----------------
 file  | File or path name | english/spanish
(1 row)

So the goal is to prevent the lexer from interpreting such expressions as file names.

Edit. Split humorous reply at VinzC's request to preserve the "Unanswered Post" status. — JRG
_________________
Gentoo addict: tomorrow I quit, I promise!... Just one more emerge...
GNU/Linux user #369763
“Wow! I feel root”


Last edited by VinzC on Thu Sep 13, 2012 10:02 am; edited 1 time in total
Back to top
View user's profile Send private message
jryan727
n00b
n00b


Joined: 04 Oct 2012
Posts: 0

PostPosted: Thu Oct 04, 2012 6:34 pm    Post subject: Reply with quote

Did you ever solve this? I just ran into the same issue.
Back to top
View user's profile Send private message
VinzC
Advocate
Advocate


Joined: 17 Apr 2004
Posts: 4630
Location: Theux (Belgium)

PostPosted: Fri Oct 05, 2012 8:03 am    Post subject: Reply with quote

jryan727 wrote:
Did you ever solve this? I just ran into the same issue.

Not yet. I had a talk with the developers on IRC, one of them was about to publish some tutorial about modifying how the lexer outputs terms for indexing. Just didn't have time to check back.
_________________
Gentoo addict: tomorrow I quit, I promise!... Just one more emerge...
GNU/Linux user #369763
“Wow! I feel root”
Back to top
View user's profile Send private message
Bones McCracker
Veteran
Veteran


Joined: 14 Mar 2006
Posts: 1553
Location: U.S.A.

PostPosted: Fri Oct 05, 2012 9:15 am    Post subject: Reply with quote

Your problem may be data modeling; why are you entering two items into a single record. If the things in your original table support multiple languages, then maybe you need a "languages" table and a "supported languages" join table pairing foreign keys to your original table and your languages table.
Back to top
View user's profile Send private message
VinzC
Advocate
Advocate


Joined: 17 Apr 2004
Posts: 4630
Location: Theux (Belgium)

PostPosted: Fri Oct 05, 2012 1:13 pm    Post subject: Reply with quote

BoneKracker wrote:
Your problem may be data modeling; why are you entering two items into a single record. If the things in your original table support multiple languages, then maybe you need a "languages" table and a "supported languages" join table pairing foreign keys to your original table and your languages table.

It's not about languages rather than how Postgres lexer considers signs that belong to punctuation. Currently occurrences of terms such as term1/term2 are interpreted as "file" or "URI" items, just like "proto://host/path". The thing is it's not obvious what to do when you don't need URI/file/path recognition [EDIT: to be able to query each term in the expression independently].
_________________
Gentoo addict: tomorrow I quit, I promise!... Just one more emerge...
GNU/Linux user #369763
“Wow! I feel root”
Back to top
View user's profile Send private message
Bones McCracker
Veteran
Veteran


Joined: 14 Mar 2006
Posts: 1553
Location: U.S.A.

PostPosted: Sat Oct 06, 2012 6:21 am    Post subject: Reply with quote

VinzC wrote:
BoneKracker wrote:
Your problem may be data modeling; why are you entering two items into a single record. If the things in your original table support multiple languages, then maybe you need a "languages" table and a "supported languages" join table pairing foreign keys to your original table and your languages table.

It's not about languages rather than how Postgres lexer considers signs that belong to punctuation. Currently occurrences of terms such as term1/term2 are interpreted as "file" or "URI" items, just like "proto://host/path". The thing is it's not obvious what to do when you don't need URI/file/path recognition [EDIT: to be able to query each term in the expression independently].

I understand what you're saying, but if you are trying to query each term in the expression independently, it may well be a data model problem and it's worth taking a moment to think about that when something like this comes up.

Assuming that's not applicable, have you tried using a pattern matching query (e.g., LIKE, SIMILAR, or SUBSTRING)? There is the typical LIKE operator, which provides basic shell-like pattern matching. There is a SIMILAR operator, which uses SQL99 regular expressions. There is the SUBSTRING operator which offers full-blown POSIX regular expressions.
http://www.postgresql.org/docs/7.3/static/functions-matching.html
Back to top
View user's profile Send private message
VinzC
Advocate
Advocate


Joined: 17 Apr 2004
Posts: 4630
Location: Theux (Belgium)

PostPosted: Sat Oct 06, 2012 10:53 am    Post subject: Reply with quote

BoneKracker wrote:
I understand what you're saying, but if you are trying to query each term in the expression independently, it may well be a data model problem and it's worth taking a moment to think about that when something like this comes up.

Indexed documents are PDF files, of which I extract the text and put the results into a text field. These documents hence happen to hold such constructions here and there. There's not much that can be changed about that.

BoneKracker wrote:
Assuming that's not applicable, have you tried using a pattern matching query (e.g., LIKE, SIMILAR, or SUBSTRING)? There is the typical LIKE operator, which provides basic shell-like pattern matching. There is a SIMILAR operator, which uses SQL99 regular expressions. There is the SUBSTRING operator which offers full-blown POSIX regular expressions.
http://www.postgresql.org/docs/7.3/static/functions-matching.html

That is inapplicable for at least two reasons: first search is made via a web application with a form that has only one field to contain many terms, a la Google if you like. It is then passed unchanged to to_tsquery() or plainto_tsquery() through a stored procedure (or function). It'd require much code to change this behaviour, not knowing if it's possible at all. Second LIKE would result in a non-negligible performance hit. Full text indexing is yet much faster than calls to string functions.

I've talked to one of the developers on IRC and the issue was acknowledged though. It's just not obvious how to change the lexer's behaviour when it encounters terms separated by a slash. In short it'd be more than interesting if one could disable some of the lexer's behaviours (don't remember how they call it), just like this one: I need no file name parsing here, period. I must be not the only one who needs this.

The only one [easy] workaround I've found was to blindly replace slashes with spaces or whatever symbol that Postgres FTS lexer would ignore. But that would also change samples of the original text that contain the matched terms.
_________________
Gentoo addict: tomorrow I quit, I promise!... Just one more emerge...
GNU/Linux user #369763
“Wow! I feel root”
Back to top
View user's profile Send private message
petrjanda
Veteran
Veteran


Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

PostPosted: Sat Oct 06, 2012 11:00 am    Post subject: Reply with quote

Why not just adjust your program to insert a couple of empty spaces? such as A[space]/[space]B ?

Petr
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta
Back to top
View user's profile Send private message
VinzC
Advocate
Advocate


Joined: 17 Apr 2004
Posts: 4630
Location: Theux (Belgium)

PostPosted: Sat Oct 06, 2012 11:30 am    Post subject: Reply with quote

petrjanda wrote:
Why not just adjust your program to insert a couple of empty spaces? such as A[space]/[space]B ?

Petr

Clever! Thanks a lot for your suggestion. Indeed the best as an immediate workaround. I'd for sure prefer not to touch the base data but this is pretty elegant. Thanks again.
_________________
Gentoo addict: tomorrow I quit, I promise!... Just one more emerge...
GNU/Linux user #369763
“Wow! I feel root”
Back to top
View user's profile Send private message
petrjanda
Veteran
Veteran


Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

PostPosted: Sat Oct 06, 2012 12:34 pm    Post subject: Reply with quote

VinzC wrote:
petrjanda wrote:
Why not just adjust your program to insert a couple of empty spaces? such as A[space]/[space]B ?

Petr

Clever! Thanks a lot for your suggestion. Indeed the best as an immediate workaround. I'd for sure prefer not to touch the base data but this is pretty elegant. Thanks again.


See pretty much all software(at least all open-source software) leaves something to be desired. As programmers we always almost need to work around specific 3rd library issues. Not that long ago, I've written a high-level C++ PDF library based around libharu(as its really fast), only to find out that it's completely un-MPSAFE and ended up having to enclose some libharu routines with a mutex lock which significantly slowed it down a lot and unfortunately I ran out of time to fix these issues inside libHaru. (i suppose they are using static variables for speed and ignored MP safety)
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Off the Wall All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum