Conference%20Tracker%20(project%20presentation) - PowerPoint PPT Presentation

About This Presentation
Title:

Conference%20Tracker%20(project%20presentation)

Description:

Title: Conference Tracker Modules Author: Kevin Last modified by: Kevin Created Date: 2/15/2006 12:04:34 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 26
Provided by: kevin538
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Conference%20Tracker%20(project%20presentation)


1
Conference Tracker(project presentation)
  • Andy Carlson
  • Vitor Carvalho
  • Kevin Killourhy
  • Mohit Kumar

2
Overview
  • Goal To find and gather salient details about
    conferences and workshops.
  • Submission Deadline
  • Location
  • Home page
  • and others
  • Preliminary results
  • Succeeded in autonomously finding conferences,
    submission deadlines, locations, and homepages
  • although not without error
  • approaches ranged from bootstrapping to focused
    crawling

3
Conference TrackerModule
With four modules, each group member worked
primarily on the design and implementation of a
particular component.
4
Bootstrapped Conference Acronym Discovery
  • Goal find conference acronyms
  • Examples ICML2006, IJCAI01, SIGMOD98
  • Discovers patterns of the form token token _____
    token token that frequently have acronyms in the
    blank
  • Redundant features web page text, morphology

5
Seed Conferences
  • We start by searching for
  • academic conferences including
  • academic conferences such as
  • and other academic conferences
  • or other academic conferences
  • This yields seeds
  • SC2001, WWW2003

6
Finding patterns
  • Searching for SC2001 and WWW2003 yields these
    ten most frequent patterns
  • QUESTIONS ABOUT ___ MAY BE
  • PAPERS AT ___ IN DENVER
  • GATHER AT ___ TO DEFINE,
  • TRIP TO ___ PC MEETING
  • PREVIOUS MESSAGE ___ BEOWULF PARTY
  • FWD FW ___ CALL FOR
  • TO OFFER ___ CLUSTER TUTORIAL
  • FWD AGENTS ___ WORKSHOP ON
  • EXHIBIT AT ___ TO FEATURE
  • 1 0 ___ 1 1

7
Finding more acronyms
  • Searching for these new patterns yields more
    acronyms
  • HFES2003
  • ICKM2005
  • SC2000
  • SCCG2000
  • SPLIT 2001
  • SVC05
  • WWW2002
  • WWW2004
  • WWW2005

8
Repeat
  • Repeating process for 5 cycles yields 95
    conference acronyms
  • AAAI-05, AAAI'05, AAAI-2000, AAAI-98, AAMAS 2002,
    AAMAS 2005, ACL 2005, ACSAC 2002, ADMKD'2005,
    AGENTS 1999, AIAS 2001, AMPT95, AMST 2002,
    APOCALYPSE 2000, AVI2004, AWESOS 2004, BABEL01,
    CASCON 1999, CASCON 2000, CHI2006, CHI 2006,
    CHI97, CHI99, CITSA 2004, COMPCON 93, CSCW2000,
    EACL06, ECOOP 2002, ECOOP 2003, ECSCW 2001,
    EDMEDIA 2001, EDMEDIA 2002, EDMEDIA 2004,
    EMBODY2, ES2002, ESANN 2002, ESANN 2004, GECCO
    2000, GWIC'94, HFES2003, HT05, HT'05, IAT99,
    ICKM2005, ICSM 2003, IFCS 2004, IJCAI-03,
    IJCAI05, IJCAI 2001, IJCAI 2005, IJCAI91,
    IJCAI95, ISCSB 2001, LICS 2001, MEMOCODE 2004,
    METRICS02, MIDDLEWARE 2003, NORDICHI 2002,
    NUFACT05, NWPER'04, NWPER'2000, OOPSLA'98,
    PARCO2003, PARLE'93, PKI04, PODC 2005, POPL'03,
    PROGRESS 2003, PRORISC 2002, PRORISC 2003,
    PRORISC 2004, PRORISC 2005, PROSODY 2002, RIAO
    94, ROMANSY 2002, SAC 2004, SAC2005, SC2000,
    SC2001, SCCG2000, SIGDOC'93, SIGGRAPH'83, SIGIR
    2001, SPIN97, SPLIT 2001, SPS 2004, SVC05,
    UML'2000, WOTUG 16, WOTUG 19, WWW2002, WWW2003,
    WWW2004, WWW2005, WWW2006

9
Best patterns
  • Most productive patterns
  • cfp ___ workshop
  • proceedings of ___ pages
  • for the ___ workshop

10
Bootstrapped Acronym Discovery-- Conclusions
  • Using morphology to find only conference acronyms
    gave 100 precision, low recall (all acronyms
    discovered were conferences or workshops)
  • Bootstrapping from a generic set of queries can
    take us from 2 to 95 acronyms
  • To boost recall, we need some method of focusing
    on the best patterns

11
Name/Page Finder (Algorithm)
  • Supplied with an acronym/year (SAC04), finds the
    corresponding conference and its homepage
    (Selected Areas in Cryptography /
    http//vlsi.uwaterloo.ca/sac04)
  • Search Google for SAC 04 and SAC 2004 (10
    results each)
  • Extract potential conference names (using
    capitalization heuristics)
  • Score each web page and potential conference name
  • Select highest-scoring page / name pair
  • Score each name and page based on
  • Heuristics (e.g., acronym embedded in name, title
    contains acronym)
  • Inclusion of words distinctive to conference
    names and pages
  • Distinctive words are determined using TF-IDF
    scoring, and word counts are updated after each
    acronym.

12
Name/Page Finder (Results)
  • Evaluation within Conference Tracker
  • Given output of the Acronym Finder, find
    name/homepage for all the acronym/year pairs.
  • When the homepage and name is completely right,
    it is labeled as all-correct. If the name is
    correct (but the homepage wrong), it is labeled
    as name-correct.

all-correct 17/73 23
name-correct 19/73 26
total 36/73 49
  • Evaluation as stand-alone component
  • Given set of 27 manually collected acronyms for
    conferences with homepages in 2006, repeat the
    above procedure

all-correct 19/27 70
name-correct 4/27 15
total 23/27 85
13
Location Finder Approach (Focused Crawling)
  • Motivation Sergei-Brins approach for
    author-book title
  • Observation Searching for ltConference Namegt
    ltLocationgt returns conference main page or
    similar pages.
  • Pattern Observation These pages state the full
    name of the conference in close proximity of the
    conference location.
  • Generalized pattern Proximity Defined
    currently by a window of 200 characters.
  • Algorithm
  • Query Google with Conference Long name and year
  • Use top URLs to look for locations in
    Proximity of conference long name (Currently
    using topmost query only)
  • Use heuristics to assess whether the page
    contains the conference location or is a list of
    such conference-location pair

14
Location Finder Pros Cons
  • PROs
  • Quite Generalised approach, because of Proximity
    operator
  • Scalable approach
  • CONs
  • Depends on the Google query results
  • Query crafting important
  • Dependant on finding out the home page or
    similar page for the conference
  • Needs Location annotators

15
Location Finder Test Results
  • 13 Conferences Workshops IEEE ACM (Using
    full name to query Google using top link for
    extraction)
  • Correct 7
  • Partially Correct -1
  • No result 5
  • Reasons
  • Annotator coverage 1 (Partially correct)
  • Name in image 4
  • Text extraction from web page 1

16
Location Finder Improvements
  • Use Co-training
  • Redundancy on the web is not being exploit
  • model is not probabilistic (currently using just
    top link for extraction)
  • Location annotator
  • Currently, a simple dictionary look-up (Use
    Minorthird/BBN)
  • Intelligent adaptable window

17
Submission Date
  • Task find the Paper Deadline Submission Date
  • Google call for papers conferenceName
    conferenceAcronym year submission deadline and
    similar queries
  • 2 types of processing pages with CFP lists and
    usual Conference pages.
  • Most of the times, no sentence structure.
  • Idea Proximity of keywords (submission,
    deadline, conference name, year, etc.)

18
Lists of CFP
19
Conference Dates Page
20
Submission Date
  • Hand-tuned Entity recognizer for dates
  • Several heuristics and regular expressions
  • No learning
  • Rank by the closest date to keywords
  • Some keywords submission, deadline, conference
    acronym, year
  • Precision
  • All conferences top1 2, top 3 5.8, events
    13.4
  • More recent conferences (SIGIR, ICML, KDD,
    2003-2006)
  • Top 1 50, Top 3 75

21
Submission Date
  • Problems
  • Main conference and workshop/tutorial dates
  • Conferences co-located
  • Same conference but previous year
  • Actual conference event dates
  • Change of deadlines
  • Hard to evaluate just couldnt find the deadline
    for some old conferences

22
Overall Results
  • Acronym finder
  • 100 precision
  • Name/page finder
  • 49 names correct
  • 23 names URLs
  • (85 on vetted data)
  • Location finder
  • 21 locations correct
  • 38 lists, 30 none
  • 11 wrong
  • Date finder
  • 2 completely right
  • 5.8 in top 3
  • 13.4 event dates

23
Lessons Learned
  • If we really are learning, then reconsider
    earlier decisions in light of new knowledge
  • Pass 1 AAAI Holger Hoos and Thomas Stuetzle,
    IJCAI Workshop
  • Pass 2 AAAI National Conference on Artificial
    Intelligence
  • Supplement creative learning algorithms with
    simple, focused crawling
  • Dont underestimate the time it takes to build
    foundational tools before learning

24
Useful Resources
  • Perl
  • Rapid prototyping
  • Packages/extensions
  • Quick/dirty text manipulation
  • Shell scripts and Unix tools
  • grep, sed, bash, lynx ...
  • Google
  • wildcards () and date ranges 2003..2006
  • cached web pages

25
Whats Next?
  • Failure notifications from later components could
    propagate backward.
  • All components could be smarter about how long to
    descend Googles returns (i.e., as long as they
    provide valuable info)
  • Given good name/acronym/location/date sets, we
    could look for lists.
Write a Comment
User Comments (0)
About PowerShow.com