23 February 2007 9:06 PM
Stop-words (aka noise keywords)

Stop-words aka noise keywords are words that generally don't have any meaning and should not be part of search engine search. Everyone should create their own list (different websites will have different lists). I have used list found on this website to get me started.

In case the link doesn't work - here are most of the keywords they use:

a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, nothing, now, nowhere, of, off, often, on, once, one, only, onto, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves

23 February 2007 1:38 PM
ColdFusion and Verity spider utility (vspider.exe)

Recently I have looked yet again at the spider included with verity search on ColdFusion 6.1 and ColdFusion 7. I have tried to use this software some time ago but never finished. Now was the time to give it another push.

The main advantage of the vspider vs. regular collection created by verity with cfindex is the fact that you actually crawl the site the 'Google' way and minimize the chance of any 'code' included as part of the search results. Also a BIG plus is the inclusion of dynamic content (i.e. pages with ?a=b in the URL). Here are some of the main points about the search:

  • Performance depends on the collection size and the number of records returned. It can be as fast as 30ms for few results to as slow as 300ms for a 100 results
  • The search doesn't filter for 'noise' keywords such as 'a' and 'the'. Also doesn't filter blanks (i.e. search for nothing returns all documents)
  • The search supports wildcards '*' and '?'
  • Supports 'and', 'or' and 'not' - but you need to catch errors for users searching just for 'and' etc.
  • By default CSS content is indexed - need to exclude it
  • The search has suggest ability - it will return suggested keywords - suggested keywords are based on what is in the index
  • The 'score' returned for each result is 'questionable' - i.e. not sure whatever it is too useful

Important note - it looks like the sites you can search and index have to be on 'localhost' - I could not get other URLs to work - so you may need to modify Apache or IIS to point to localhost for your website. Something like 'http://localhost/tomkitta/' is fine.

Simple indexing string can be: vspider -style D:\CFusionMX7\verity\Data\stylesets\ColdFusionVspider -collection D:\CFusionMX7\verity\collections\tomkitta -start http://localhost/tomkitta/ -cgiok

The code for version 6.1 of ColdFusion uses a bit different path (it uses spider version 3.7 while CF7 uses spider version 5), vspider -collection D:\CFusionMX\verity\collections\sony -start http://localhost/tomkitta/ -cgiok

Some vspider.exe - Verity ColdFusion spider tutorials and resources

