The H.W. Wilson Company - New York, Dublin
 
 
 

  Stopwords in WilsonWeb

   
 

This document describes the present use of stopwords in WilsonWeb NG.
Presently, the system does not recognize any word as being a stopword with respect to indexing. That is, all words are effectively searchable.

Some words are removed from queries when the <freetext> operator is used. A table of words is attached. All stopwords are stripped from a <freetext> query unless the entire query is stopwords, in which case all of the words are searched.

Examples:

User types

What is searched

<freetext>"making"

Making

<freetext>"making cookies"

Cookies

<freetext> "peace negotiations in the middle east"

"peace negotiations middle east"

 

The <freetext> operator is used only in Basic Search when the Natural Language choice is selected.
For any other operator, the stopword is searchable. So if you search "thing <in> su", you will receive documents where the word ‘thing’ occurs as a word in the subject.

Side Effects

In those operations where stopwords are permitted, such as when you are in advanced search and do "the fly" as "title", then you will get only those records that have both words. Since a non-rules-based field search in Advanced Search uses the <near> operator, and since the stopword is not stripped out, you will not get records that have only one of the words in the title.

Example searches done in OmniFile FT Mega

Search

As

Hits

Why

H W Wilson Co Inc

Subject(s)

45

Each is a 100% match on subject

The H W Wilson Co Inc

Subject(s)

0

Headings do not have "The"

H W Wilson Co Inc

All

566,429

100% hits in SU, and many other free text hits

The H W Wilson Co Inc

All

566,429

<freetext> strips the "the," "h," and "w," but <near> does not, so you get no 100% ranking hits.

 

Relevance Ranking

When the ranking is left to Verity, as is done on all queries other than rules-based queries, then the stopwords will influence the rank. The ranking algorithm will reduce the influence of words that the search engine considers omnipresent from the scoring. So while there may be dozens of occurrences of the word "the" in documents retrieved by the search <accrue>(rocky,the,flying,squirrel), and some that have lots of occurrences of the word the and none of the other words, the highest ranking record will be those documents that have all four of the words.

Highlighting

Whether or not the word is removed from the query, the word will be included in the highlighting operation on the full text.

Stopwords Recognized by the <freetext> Operator

The list used by the search engine is taken from a research paper:

Fox, Christopher, "A Stop List for General Text", SIGIR Forum, v 24, n 1-2, Fall 89/Winter 90, p 19-35

A

both

few

important

much

parted

since

under

about

but

find

in

must

parting

small

until

above

by

finds

interest

my

parts

smaller

up

across

C

first

interested

myself

per

smallest

upon

after

came

for

interesting

N

perhaps

so

us

again

can

four

interests

necessary

place

some

use

against

cannot

from

into

need

places

somebody

uses

all

case

full

is

needed

point

someone

used

almost

cases

fully

it

needing

pointed

something

V

alone

certain

further

its

needs

pointing

somewhere

very

along

certainly

furthered

itself

never

points

state

W

already

clear

furthering

J

new

possible

states

want

also

clearly

furthers

just

newer

present

still

wanted

although

come

G

K

newest

presented

such

wanting

always

could

gave

keep

next

presenting

sure

wants

among

D

general

keeps

no

presents

T

was

an

did

generally

kind

non

problem

take

way

and

differ

get

knew

not

problems

taken

ways

another

different

gets

know

nobody

put

than

we

any

differently

give

known

noone

puts

that

well

anybody

do

given

knows

nothing

Q

the

wells

anyone

does

gives

L

now

quite

their

went

anything

done

go

large

nowhere

R

them

were

anywhere

down

going

largely

number

rather

then

what

are

downed

good

last

numbers

really

there

when

area

downing

goods

later

O

right

therefore

where

areas

downs

got

latest

of

room

these

whether

around

during

great

least

off

rooms

they

which

as

E

greater

less

often

S

thing

while

ask

each

greatest

let

old

said

things

who

at

early

group

lets

older

same

think

whole

away

either

grouping

like

oldest

saw

thinks

whose

B

end

groups

likely

on

say

this

why

back

ended

H

long

once

says

those

will

backed

ending

had

longer

one

second

though

with

backing

ends

has

longest

only

seconds

thought

within

backs

enough

have

M

open

see

thoughts

without

be

even

having

made

opened

sees

three

work

because

evenly

he

make

opening

seem

through

worked

become

ever

her

making

opens

seemed

thus

working

becomes

every

herself

man

or

seeming

to

works

became

everybody

here

many

order

seems

today

would

been

everyone

high

me

ordered

several

together

Y

before

everything

higher

member

ordering

shall

too

year

began

everywhere

highest

members

orders

she

took

years

behind

F

him

men

other

should

toward

yet

being

face

himself

might

others

show

turn

you

beings

faces

his

more

our

showed

turned

young

best

fact

how

most

out

showing

turning

younger

better

facts

however

mostly

over

shows

turns

youngest

between

far

I

mr

P

side

two

your

big

felt

if

mrs

part

sides

U

yours

 

 

H.W. Wilson Home Page  
    © 2008 The HW Wilson Company®  800-367-6770 / 718-588-8400

    950 University Avenue, Bronx, New York 10452       Privacy Policy