|
This document describes the present use of stopwords in WilsonWeb NG.
Presently, the system does not recognize any word as being a stopword with
respect to indexing. That is, all words are effectively searchable.
Some words are removed from queries when the <freetext> operator is used.
A table of words is attached. All stopwords are stripped from a <freetext>
query unless the entire query is stopwords, in which case all of the words
are searched.
Examples:
|
User
types
|
What
is searched
|
|
<freetext>"making"
|
Making
|
|
<freetext>"making
cookies"
|
Cookies
|
|
<freetext>
"peace negotiations in
the middle east"
|
"peace
negotiations middle east"
|
The <freetext> operator is used only in Basic Search
when the Natural Language choice is selected.
For any other operator, the stopword is searchable. So if you search
"thing <in> su", you will receive documents where the word ‘thing’ occurs
as a word in the subject. Side Effects
In those operations where stopwords are permitted, such as when you are in
advanced search and do "the fly" as "title", then you will get only those
records that have both words. Since a non-rules-based field search in
Advanced Search uses the <near> operator, and since the stopword is not
stripped out, you will not get records that have only one of the words in
the title. Example searches done in OmniFile FT Mega
|
Search
|
As
|
Hits
|
Why
|
|
H
W Wilson Co Inc
|
Subject(s)
|
45
|
Each
is a 100% match on subject
|
|
The
H W Wilson Co Inc
|
Subject(s)
|
0
|
Headings
do not have "The"
|
|
H
W Wilson Co Inc
|
All
|
566,429
|
100%
hits in SU, and many other
free text hits
|
|
The
H W Wilson Co Inc
|
All
|
566,429
|
<freetext>
strips the "the," "h,"
and "w," but <near>
does not, so you get no 100%
ranking hits.
|
Relevance Ranking When the ranking is left to
Verity, as is done on all queries other than rules-based queries, then the
stopwords will influence the rank. The ranking algorithm will reduce the
influence of words that the search engine considers omnipresent from the
scoring. So while there may be dozens of occurrences of the word "the" in
documents retrieved by the search <accrue>(rocky,the,flying,squirrel), and
some that have lots of occurrences of the word the and none of the other
words, the highest ranking record will be those documents that have all
four of the words. Highlighting
Whether or not the word is removed from the query, the word will be
included in the highlighting operation on the full text.
Stopwords Recognized by the <freetext> Operator The
list used by the search engine is taken from a research paper:
Fox, Christopher, "A Stop List for General Text", SIGIR Forum, v 24, n
1-2, Fall 89/Winter 90, p 19-35
|
A
|
both
|
few
|
important
|
much
|
parted
|
since
|
under
|
|
about
|
but
|
find
|
in
|
must
|
parting
|
small
|
until
|
|
above
|
by
|
finds
|
interest
|
my
|
parts
|
smaller
|
up
|
|
across
|
C
|
first
|
interested
|
myself
|
per
|
smallest
|
upon
|
|
after
|
came
|
for
|
interesting
|
N
|
perhaps
|
so
|
us
|
|
again
|
can
|
four
|
interests
|
necessary
|
place
|
some
|
use
|
|
against
|
cannot
|
from
|
into
|
need
|
places
|
somebody
|
uses
|
|
all
|
case
|
full
|
is
|
needed
|
point
|
someone
|
used
|
|
almost
|
cases
|
fully
|
it
|
needing
|
pointed
|
something
|
V
|
|
alone
|
certain
|
further
|
its
|
needs
|
pointing
|
somewhere
|
very
|
|
along
|
certainly
|
furthered
|
itself
|
never
|
points
|
state
|
W
|
|
already
|
clear
|
furthering
|
J
|
new
|
possible
|
states
|
want
|
|
also
|
clearly
|
furthers
|
just
|
newer
|
present
|
still
|
wanted
|
|
although
|
come
|
G
|
K
|
newest
|
presented
|
such
|
wanting
|
|
always
|
could
|
gave
|
keep
|
next
|
presenting
|
sure
|
wants
|
|
among
|
D
|
general
|
keeps
|
no
|
presents
|
T
|
was
|
|
an
|
did
|
generally
|
kind
|
non
|
problem
|
take
|
way
|
|
and
|
differ
|
get
|
knew
|
not
|
problems
|
taken
|
ways
|
|
another
|
different
|
gets
|
know
|
nobody
|
put
|
than
|
we
|
|
any
|
differently
|
give
|
known
|
noone
|
puts
|
that
|
well
|
|
anybody
|
do
|
given
|
knows
|
nothing
|
Q
|
the
|
wells
|
|
anyone
|
does
|
gives
|
L
|
now
|
quite
|
their
|
went
|
|
anything
|
done
|
go
|
large
|
nowhere
|
R
|
them
|
were
|
|
anywhere
|
down
|
going
|
largely
|
number
|
rather
|
then
|
what
|
|
are
|
downed
|
good
|
last
|
numbers
|
really
|
there
|
when
|
|
area
|
downing
|
goods
|
later
|
O
|
right
|
therefore
|
where
|
|
areas
|
downs
|
got
|
latest
|
of
|
room
|
these
|
whether
|
|
around
|
during
|
great
|
least
|
off
|
rooms
|
they
|
which
|
|
as
|
E
|
greater
|
less
|
often
|
S
|
thing
|
while
|
|
ask
|
each
|
greatest
|
let
|
old
|
said
|
things
|
who
|
|
at
|
early
|
group
|
lets
|
older
|
same
|
think
|
whole
|
|
away
|
either
|
grouping
|
like
|
oldest
|
saw
|
thinks
|
whose
|
|
B
|
end
|
groups
|
likely
|
on
|
say
|
this
|
why
|
|
back
|
ended
|
H
|
long
|
once
|
says
|
those
|
will
|
|
backed
|
ending
|
had
|
longer
|
one
|
second
|
though
|
with
|
|
backing
|
ends
|
has
|
longest
|
only
|
seconds
|
thought
|
within
|
|
backs
|
enough
|
have
|
M
|
open
|
see
|
thoughts
|
without
|
|
be
|
even
|
having
|
made
|
opened
|
sees
|
three
|
work
|
|
because
|
evenly
|
he
|
make
|
opening
|
seem
|
through
|
worked
|
|
become
|
ever
|
her
|
making
|
opens
|
seemed
|
thus
|
working
|
|
becomes
|
every
|
herself
|
man
|
or
|
seeming
|
to
|
works
|
|
became
|
everybody
|
here
|
many
|
order
|
seems
|
today
|
would
|
|
been
|
everyone
|
high
|
me
|
ordered
|
several
|
together
|
Y
|
|
before
|
everything
|
higher
|
member
|
ordering
|
shall
|
too
|
year
|
|
began
|
everywhere
|
highest
|
members
|
orders
|
she
|
took
|
years
|
|
behind
|
F
|
him
|
men
|
other
|
should
|
toward
|
yet
|
|
being
|
face
|
himself
|
might
|
others
|
show
|
turn
|
you
|
|
beings
|
faces
|
his
|
more
|
our
|
showed
|
turned
|
young
|
|
best
|
fact
|
how
|
most
|
out
|
showing
|
turning
|
younger
|
|
better
|
facts
|
however
|
mostly
|
over
|
shows
|
turns
|
youngest
|
|
between
|
far
|
I
|
mr
|
P
|
side
|
two
|
your
|
|
big
|
felt
|
if
|
mrs
|
part
|
sides
|
U
|
yours
|
|