Towards a Reference Corpus of Web Genres - PowerPoint PPT Presentation

About This Presentation
Title:

Towards a Reference Corpus of Web Genres

Description:

DIY. DIY. DIY ... Linkfarm 45. Link Collection / Hotlist 46. List of Products 47. List of Projects ... homepage (68 sites), project website (52 sites), city ... – PowerPoint PPT presentation

Number of Views:402
Avg rating:3.0/5.0
Slides: 28
Provided by: georg123
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Towards a Reference Corpus of Web Genres


1
Towards a Reference Corpus of Web Genres for the
Evaluation of Genre Identification Systems
Georg Rehm1, Marina Santini2, Alexander
Mehler3, Pavel Braslavski4, Rüdiger Gleim3,
Andrea Stubbe5, Svetlana Symonenko6, Mirko
Tavosanis7, Vedrana Vidulin8
University of Tübingen, Germany1 SFB 441 Linguistic Data Structures DSV, Sweden2 KTH-Stockholm University University of Bielefeld, Germany3 Computational Linguistics Dept.
Inst. of Engineering Science, RAS4 Ekaterinenburg, Russia conject AG5 Munich, Germany Nitol, LLC6 Moscow, Russia
Università di Pisa, Italy7 Dipartimento di Studi italianistici Jožef Stefan Institute8 Ljubljana, Slovenia Corresponding author georg.rehm_at_uni-tuebingen.de
Language Resources and Evaluation Conference
LREC 2008
2
Introduction
  • Genres are specific types of text.
  • Genres have, roughly speaking, three
    characteristic properties
  • Content topic
  • Form layout, design, text structure etc.
  • Function communicative purpose etc.
  • Genres are socially specified sets of rules and
    conventions.
  • Genres are recognised by particular discourse
    communities.
  • Genres usually have established names.

3
Examples of Traditional Genres
Guidebook
Cookbook
Almanac
Dictionary
Textbook
Novel
4
Scope of this Talk
  • There are not only hundreds (Dimter, 1981), but
    thousands (Adamzik, 1995) of genres
  • Shopping list
  • Love letter
  • Flyer
  • Weather forecast
  • CV
  • PhD thesis
  • This talk is not about traditional, paper-based
    genres.
  • This talk is about web genres.

5
Web Genres
  • Studies have shown that genres also exist in the
    web, e.g.
  • Personal homepage
  • FAQ
  • Blog
  • Search engine
  • Encyclopedia
  • Web shop
  • Web genres are more complex than traditional
    genres
  • The web is a hypertext system
  • Interactive features
  • Multimedia

6
Automatic Web Genre Identification
  • If we were able to identify web genres
    automatically, we could exploit this information
    in search engines. Find
  • textbook web pages that contain language
    resource
  • PhD thesis web pages that contain RCG parsing
  • About 20 different approaches have been published
    in this area (incl. the identification of
    traditional genres). They mainly use
  • Machine learning methods
  • Hand-crafted genre detection rules

7
Automatic Web Genre Identification
  • All approaches have some characteristics in
    common.
  • Nearly every group of researchers
  • have their own personal definition of web
    genre,
  • create their own document collection,
  • create their own set of web genre labels,
  • annotate their corpora with these web genre
    labels.

Web Genre Identification Approach
Classification algorithm
Corpus (collection of web documents)
Tag set (genre categories)
DIY
DIY
DIY
8
Automatic Web Genre Identification
Approach 1
Algorithm 1
Corpus 1
Tag set 1
Approach 2
Algorithm 2
Corpus 2
Tag set 2
Approach 3
Algorithm 3
Corpus 3
Tag set 3
Approach 4
Algorithm 4
Corpus 4
Tag set 4
Approach 5
Algorithm 5
Corpus 5
Tag set 5
Its impossible to compare such isolated
approaches.
9
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference Corpus of Web Genres enables
comparative evaluation
10
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
11
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
12
Assigning Genre Labels to Web Pages
  • The construction of a genre corpus involves the
    task of assigning genre labels to web documents
    by a group of annotators.
  • Previous studies have shown that this is a very
    hard task.

Set of genre categories
tag with genre category
13
Preliminary Study
  • We conducted a survey amongst the group of
    authors
  • Goal to measure the agreement of genre labels
    assigned to a random sample of 50 web documents
    by persons who are engaged in genre-related
    research.
  • Seven of the nine authors participated.
  • Result the categories assigned by the
    participants contain a very high number of
    disparate terms at various levels of abstraction.
  • Conclusion the task of assigning genre labels to
    web documents is even for linguists who work on
    genres very hard.

14
Assigning Genre Labels to Web Pages
  • Consistency High
  • Participant 1 News article
  • Participant 2 Article/commentary
  • Participant 3 Article
  • Participant 4 Feature
  • Participant 5 A newsletter article
  • Participant 6 News article
  • Participant 7 Journalistic

15
Assigning Genre Labels to Web Pages
  • Consistency Low
  • P1 Entry page of the website of a research
    journal
  • P2 Table of contents with snippets
  • P3 Portal, link collection
  • P4 Bibliography/List of Articles
  • P5 A homepage of a subscription-based academic
    journal
  • P6 Homepage
  • P7 Index, Content Delivery

16
Genre Category Sets in Previous Approaches
  • Almost all category sets used in previous
    approaches are
  • limited in size and scope and
  • contain categories that cannot be considered
    genres

Lim et al. (2005) Personal homepages Public homepages Commercial homepages Bulletin collections Link collections Image collections Simple tables/lists Input pages Journalistic materials Research reports Of?cial materials Informative materials FAQs Discussions Product speci?cations Others
Vidulin et al. (2007) Blog Childrens Commercial/Promotional Community Content Delivery Entertainment Error Message FAQ Gateway Index Informative Journalistic Of?cial Personal Poetry Scienti?c Shopping User Input
17
Shared Genre Category Sets
  • A set of genre categories is needed so that we
    can assign web genre labels to web documents.
  • Requirements for this shared category set
  • It should be precise, scalable, as unambiguous as
    possible, and reflect the genre-reality as it
    presents itself in the web.
  • The majority of researchers in this field should
    agree upon the category set or sets.
  • We used a wiki to come up with an initial
    proposal of 78 web genre categories.

18
Our Proposal for a Shared Genre Category Set
1. About Page 2. Abstract 3. Agenda (Schedule,
Calendar) 4. Announcement 5. Application 6.
Bibliography 7. Biography 8. Chronicle 9. Code
Listings 10. Column / Editorial / Lead Article
11. Comic 12. Contact Form 13. Contract /
Disclaimer / Terms and Conditons 14. Corporate
Blog 15. Curriculum Vitae / CV / Resume 16. Data
/ Statistics / Data Sheet 17. Diary, Blog 18.
Dictionary 19. Directory of Persons or
Organisations 20. Discussion Group / Newsgroup
21. Download 22. Drama / Play 23. Encyclopedia
24. Errata 25. Error Message / Empty Page / Under
Construction Page 26. Essay 27. Exercises
(Problems) 28. FAQ 29. Feature Story / News
Reportage 30. Game (Quiz, Puzzle) 31. Glossary
32. Guestbook 33. Homepage / Front Page / Entry
Page 34. Horoscope 35. Index 36. Instruction 37.
Interview 38. Invitation 39. Job Listing 40. Joke
41. Law / Regulation / Rule / Proclamation 42.
Letter / Mail / E-Mail 43. Letter to the Editor
44. Linkfarm 45. Link Collection / Hotlist 46.
List of Products 47. List of Projects 48. Login
Page 49. Media (Images, videos, music, sound) 50.
Meeting minutes 51. News Article 52. News
Collection / Newsletter / Digest 53. Obituary 54.
Of?cial Report 55. Ordering Form / Booking Form
56. Pamphlet 57. Petition 58. Promotional /
Advertisement 59. Poem / Poetry / Lyrics 60.
Pornographic 61. Prose Fiction 62. Quotation 63.
Reportage 64. Research Report 65. Review
(Testimonial) 66. Script (Manuscript) 67. Search
Form 68. Sermon 69. Shop 70. Speci?cation 71.
Speech 72. Splash Page / Gateway / Welcome Page
73. Strategic Plans 74. Survey 75. Table of
contents / Sitemap / Navigation 76. Thesis 77.
Travel Guide 78. Tutorial
19
Tagging HTML Documents with Genre Categories
1) tag HTML documents the most common approach
tag
20
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
21
Reference Collection of Web Documents
  • We plan to build the reference corpus in two
    stages
  • First, we will apply our shared set of genre
    categories to existing collections as a proof of
    concept.
  • Initial step towards an objective evaluation and
    integrative compatibility of individual
    approaches.
  • Second, we will use a crawler to gather more
    recent as well as more diverse sets of documents.

22
Reference Collection of Web Genres (Selection)
  • Web Corpus for English (Santini, 2007)
    editorial, biography, do-it-yourself guide,
    feature article (20 web pages each).
  • German corpus (Mehler et al., 2007, 2008)
    conference website (50 sites), personal academic
    homepage (68 sites), project website (52 sites),
    city website (180 sites).
  • Hierachical Web Genre Collection (Stubbe and
    Ringlstetter, 2007), 32 genre classes, 40 HTML
    ?les/class, English.
  • Corpus of 400 blog posts, Italian (Tavosanis,
    2007).
  • English (65,177 pages) and Russian (29,650 pages)
    corpora (Sharoff, 2007).

23
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
24
Corpus Management and Annotation Tools
  • Construction of the reference corpus requires
    tools that support
  • compiling a document collection and
  • annotating HTML documents.
  • We use the HyGraph toolbox
  • Supports researchers in the process of corpus
    compilation, annotation and analysis
  • Annotate at various levels
  • Assign confidence values
  • Support for multiple tag setsand category
    systems
  • Uses stand-off annotation

25
Towards a Reference Corpus of Web Genres
Reference collection of web documents
Shared genre category set or sets
Annotation tool
Reference Corpus of Web Genres
26
Summary and Future Work
  • We construct a reference corpus of web genres.
  • Provide a shared resource for researchers who
    work on web genre identi?cation and the
    evaluation of these systems.
  • Future work includes the further realisation of
    this resource
  • Apply a set of genre categories to existing
    corpora.
  • Collect a large set of new documents that will be
    categorised based on annotation guidelines using
    HyGraph.
  • Assign genre labels to single web documents first
    and to page segments as well as complete websites
    later.

27
Q/A
  • Thanks for your attention!
  • Please get in touch if you (plan to) work in the
    field of
  • automatic web genre identification or a related
    area
  • georg.rehm_at_uni-tuebingen.de
  • http//129.70.40.20/WebGenreWiki/
  • A mailing list will be available soon.
Write a Comment
User Comments (0)
About PowerShow.com