
rac wrote:Here's how the phpBB search function works. Each post is split into words. First, some characters are replaced. There are three classes of characters here, those that get replaced by spaces, those that get elided, and those that get left alone. Next, whitespace is used to delineate words. All words of less than 3 or more than 20 characters are dropped. Then an entry is made in the dictionary table for every word that is not in the dictionary, so that it can be referenced by number. An entry is made in a colossal table for each and every word in each and every post. That's what gets searched against.
To get back to Reformist's two examples, gnome2 is a word. 'gnome 2' is two words, one of which is impossible to match because it is one character long. "1.1.0" is three words, each of which are impossible to match, because they are one character long. One modification that it might be feasible to make would be to change the status of '.'. If it were left alone, version numbers would become searchable. However, words at the end of sentences, followed by periods, would become unsearchable, because a separate entry would be made including the period. If it were elided, the end-of-sentence problem would go away, but then you would have to search for "abiword and 110", and "2.1" would become "21" and fall under the three-character limit.
That, while a nice feature, is completely impossible with the current way the search databases work, because the search match tables have fields for word number and post id only. There is no sense of what words occur next to one another.port001 wrote:would be cool if we could do the old "kde3.1 emerge failure" note the "" that makes sure it matches the whole string. like on google
It _is_ possible, by making php look into the actual post texts. Would be a query like:rac wrote:That, while a nice feature, is completely impossible with the current way the search databases work, because the search match tables have fields for word number and post id only. There is no sense of what words occur next to one another.port001 wrote:would be cool if we could do the old "kde3.1 emerge failure" note the "" that makes sure it matches the whole string. like on google
Indeed, I'd like that too, would be a great improvement.gsfgf wrote:at least for the quick search.xlyz wrote:please make "AND" default. "OR" is seldom usedrac wrote: Join all your terms with 'and' or check "search for all terms", otherwise the default is "or", which is probably not what you want.

It is harder than it looks, for a couple of reasons. Either a period causes a word break or it doesn't. Now what would be best is if it caused a word break only if it wasn't a version number, but that could be a challenging regex. Maybe we could steal it from Portage. I think if we're going to go this far, we might as well get it right and have "version numbers" go into the index.gsfgf wrote:If you convert 2.2.2 to 222 and make search strip periods as well so if you search for kde 3.1 search will treat it as kde 31, that would solve that issue. That may be harder than it looks, though.

Being unlucky. To help keep the size of the search tables down, there is a "stopword list" in phpBB's search function. Words on the stopword list are not indexed because they are too common. Unfortunately for your example, world is on the stopword list, so nothing shows up, and then when you search for "world and file", you are really only searching for 'file'.Lion wrote:So, my question is: what am I doing wrong?
what are the words included in the list?rac wrote:Being unlucky. To help keep the size of the search tables down, there is a "stopword list" in phpBB's search function.Lion wrote:So, my question is: what am I doing wrong?
http://forums.gentoo.org/language/lang_ ... pwords.txtxlyz wrote:what are the words included in the list?
I assume that gentoo bugzilla has a similar stopword list? It certainly would explain some of my difficulties in using search there.rac wrote:http://forums.gentoo.org/language/lang_ ... pwords.txtxlyz wrote:what are the words included in the list?
Bugzilla's completely different software. I don't know off the top of my head whether there's a stopword list. If I get some time I may look into it further.dufeu wrote:I assume that gentoo bugzilla has a similar stopword list?

Perhaps a lesson could be learned from Google. Stopwords are identified if they are included in a search, viz:rac wrote:Being unlucky. To help keep the size of the search tables down, there is a "stopword list" in phpBB's search function. Words on the stopword list are not indexed because they are too common. Unfortunately for your example, world is on the stopword list, so nothing shows up, and then when you search for "world and file", you are really only searching for 'file'.Lion wrote:So, my question is: what am I doing wrong?
"the" is a very common word and was not included in your search