Title: BCS Powerpoint template white
1Information Search Retrieval Problems,
solutions, trends Tony Rose, PhD MBCS
CEng Vice-Chair, BCS IRSG
going further together
2Contents
- The BCS Information Retrieval SG
- What is IR anyway?
- How search engines work
- Why search is hard
- Wheres it all going?
3Information Retrieval SG
- Growing rapidly
- 750 members
- Annual conference (ECIR)
- FDIA
- Various 1-day events
- Search Solutions
- Informer
- Discounts for various events, e.g. SIGIR
- is free to join!
4Information Retrieval SG
- Traditional focus on search (text retrieval)
- Knowledge management, Multimedia retrieval, User
experience, Information visualisation,
extraction, summarisation, etc. - Latest issue of Informer
- Searching for the Music You Like
- Exploring Maps through Geo-referenced Images and
RDF Shared Metadata - Using Semantic Relations to improve Question
Answering - Modeling Annotation of Dance Media Semantics
5What is IR?
- Science of searching for
- information in documents
- documents themselves
- metadata which describe documents,
- within databases
- whether relational stand-alone databases or
hypertextually-networked databases such as the
World Wide Web
6The Need for IR
- In a word Infoglut
- 800Mb of recorded information is produced per
person per year Computing magazine - Up to 80 of corporate information is
unstructured - Documents, emails, images, voicemail, etc.
- So cant we just use Google?
7How do Search Engines Work?
- On the surface
- Understand what the user wants
- Find documents about that topic
- In reality
- Count words
- Apply a simple equation
8How do Search Engines Work?
- Measure the conceptual distance between your
query and each document in the DB - Return the best matches
Source Maristella Agosti, University of Padova
9The Central Problem in IR
Information Seeker
Author
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
Source Jimmy Lin, University of Maryland
10The Central Problem in IR
- How do you represent the concepts?
- Documents and queries bag of words
- Unordered set of terms numeric weights
- How do you calculate similarity?
- Set theory (e.g. Boolean)
- Algebraic (e.g. vector space)
- Probabilistic
11IR models
Source Wikipedia
12How do we Evaluate Search?
- Assume that results are either relevant or
non-relevant - Precision
- Proportion of retrieved documents that are
relevant - Recall
- Proportion of known-relevant documents that were
actually retrieved - But what about indexing / retrieval speed, query
language, user experience, etc?
relevant
retrieved
13Why Search is Hard
- Document representation
- Keywords are not enough
- Blind Venetian Venetian Blind
- Terms are not independent
- Structural discourse dependencies,
co-references, etc. - Imperfect stop lists
- the, and, of
14Why Search is Hard
- Morphological relationships
- Computer, computing, compute, computed
- Index documents using word stems
- False positives
- organization, organ ? organ
- police, policy ? polic
- arm, army ? arm
- False negatives
- cylinder, cylindrical
- create, creation
- Europe, European
- Prefixes are particularly difficult
- Un, dis
- Delegate de-leg-ate
- Ratify rat-ify
15Why Search is Hard
- Named entity recognition
- Companies in New York
- New companies in York
- NEs are highly discriminatory
- People
- Places
- Organisations
- Many vertical applications
- e.g. bioscience
16Why Search is Hard
- Semantic relationships
- Car automobile
- Buy purchase
- Sick ill
- Synonym rings
- Car, automobile, truck, bus, taxi...
- Appropriate level of abstraction depends on user
task - Development of subject-specific taxonomies
- concept matching
17Why Search is Hard
- Word sense disambiguation
- Bank
- Financial institution?
- Part of a river?
- An aerial manoeuvre?
- Active research area
- Categorisation clustering of results
18Googles Insight
- Exploit the link structure inherent in the web
- calculate measure of documents value
- Independent of any query
- PageRank
- Overall relevance based on 100 parameters
- Constant battle with SEOs
- Enterprise search is a different proposition
- As is desktop search
19Wheres it all going?
- Vertical search
- Jobs, travel, health, people, etc.
- Rich media search
- Audio, video, TV, images
- Specialised content search
- blogs, news, classifieds
- Social search
- Personalisation
20Wheres it all going?
- Mobile search
- Answer engines
- Active research community in Question Answering
- Multi / cross-lingual search
- Search agents
- Human UI
21Further Information
- www.irsg.bcs.org
- Informer
- ECIR (March 2008, Glasgow)
- Search Solutions 2008 (Sept 2008, London)