Title: Towards a Reference Corpus of Web Genres
1Towards a Reference Corpus of Web Genres for the
Evaluation of Genre Identification Systems
Georg Rehm1, Marina Santini2, Alexander
Mehler3, Pavel Braslavski4, Rüdiger Gleim3,
Andrea Stubbe5, Svetlana Symonenko6, Mirko
Tavosanis7, Vedrana Vidulin8
University of Tübingen, Germany1 SFB 441 Linguistic Data Structures DSV, Sweden2 KTH-Stockholm University University of Bielefeld, Germany3 Computational Linguistics Dept.
Inst. of Engineering Science, RAS4 Ekaterinenburg, Russia conject AG5 Munich, Germany Nitol, LLC6 Moscow, Russia
Università di Pisa, Italy7 Dipartimento di Studi italianistici Jožef Stefan Institute8 Ljubljana, Slovenia Corresponding author georg.rehm_at_uni-tuebingen.de
Language Resources and Evaluation Conference
LREC 2008
2Introduction
- Genres are specific types of text.
- Genres have, roughly speaking, three
characteristic properties - Content topic
- Form layout, design, text structure etc.
- Function communicative purpose etc.
- Genres are socially specified sets of rules and
conventions. - Genres are recognised by particular discourse
communities. - Genres usually have established names.
3Examples of Traditional Genres
Guidebook
Cookbook
Almanac
Dictionary
Textbook
Novel
4Scope of this Talk
- There are not only hundreds (Dimter, 1981), but
thousands (Adamzik, 1995) of genres - Shopping list
- Love letter
- Flyer
- Weather forecast
- CV
- PhD thesis
-
- This talk is not about traditional, paper-based
genres. - This talk is about web genres.
5Web Genres
- Studies have shown that genres also exist in the
web, e.g. - Personal homepage
- FAQ
- Blog
- Search engine
- Encyclopedia
- Web shop
- Web genres are more complex than traditional
genres - The web is a hypertext system
- Interactive features
- Multimedia
6Automatic Web Genre Identification
- If we were able to identify web genres
automatically, we could exploit this information
in search engines. Find - textbook web pages that contain language
resource - PhD thesis web pages that contain RCG parsing
- About 20 different approaches have been published
in this area (incl. the identification of
traditional genres). They mainly use - Machine learning methods
- Hand-crafted genre detection rules
7Automatic Web Genre Identification
- All approaches have some characteristics in
common. - Nearly every group of researchers
- have their own personal definition of web
genre, - create their own document collection,
- create their own set of web genre labels,
- annotate their corpora with these web genre
labels.
Web Genre Identification Approach
Classification algorithm
Corpus (collection of web documents)
Tag set (genre categories)
DIY
DIY
DIY
8Automatic Web Genre Identification
Approach 1
Algorithm 1
Corpus 1
Tag set 1
Approach 2
Algorithm 2
Corpus 2
Tag set 2
Approach 3
Algorithm 3
Corpus 3
Tag set 3
Approach 4
Algorithm 4
Corpus 4
Tag set 4
Approach 5
Algorithm 5
Corpus 5
Tag set 5
Its impossible to compare such isolated
approaches.
9Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference Corpus of Web Genres enables
comparative evaluation
10Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
11Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
12Assigning Genre Labels to Web Pages
- The construction of a genre corpus involves the
task of assigning genre labels to web documents
by a group of annotators. - Previous studies have shown that this is a very
hard task.
Set of genre categories
tag with genre category
13Preliminary Study
- We conducted a survey amongst the group of
authors - Goal to measure the agreement of genre labels
assigned to a random sample of 50 web documents
by persons who are engaged in genre-related
research. - Seven of the nine authors participated.
- Result the categories assigned by the
participants contain a very high number of
disparate terms at various levels of abstraction. - Conclusion the task of assigning genre labels to
web documents is even for linguists who work on
genres very hard.
14Assigning Genre Labels to Web Pages
- Consistency High
- Participant 1 News article
- Participant 2 Article/commentary
- Participant 3 Article
- Participant 4 Feature
- Participant 5 A newsletter article
- Participant 6 News article
- Participant 7 Journalistic
15Assigning Genre Labels to Web Pages
- Consistency Low
- P1 Entry page of the website of a research
journal - P2 Table of contents with snippets
- P3 Portal, link collection
- P4 Bibliography/List of Articles
- P5 A homepage of a subscription-based academic
journal - P6 Homepage
- P7 Index, Content Delivery
16Genre Category Sets in Previous Approaches
- Almost all category sets used in previous
approaches are - limited in size and scope and
- contain categories that cannot be considered
genres
Lim et al. (2005) Personal homepages Public homepages Commercial homepages Bulletin collections Link collections Image collections Simple tables/lists Input pages Journalistic materials Research reports Of?cial materials Informative materials FAQs Discussions Product speci?cations Others
Vidulin et al. (2007) Blog Childrens Commercial/Promotional Community Content Delivery Entertainment Error Message FAQ Gateway Index Informative Journalistic Of?cial Personal Poetry Scienti?c Shopping User Input
17Shared Genre Category Sets
- A set of genre categories is needed so that we
can assign web genre labels to web documents. - Requirements for this shared category set
- It should be precise, scalable, as unambiguous as
possible, and reflect the genre-reality as it
presents itself in the web. - The majority of researchers in this field should
agree upon the category set or sets. - We used a wiki to come up with an initial
proposal of 78 web genre categories.
18Our Proposal for a Shared Genre Category Set
1. About Page 2. Abstract 3. Agenda (Schedule,
Calendar) 4. Announcement 5. Application 6.
Bibliography 7. Biography 8. Chronicle 9. Code
Listings 10. Column / Editorial / Lead Article
11. Comic 12. Contact Form 13. Contract /
Disclaimer / Terms and Conditons 14. Corporate
Blog 15. Curriculum Vitae / CV / Resume 16. Data
/ Statistics / Data Sheet 17. Diary, Blog 18.
Dictionary 19. Directory of Persons or
Organisations 20. Discussion Group / Newsgroup
21. Download 22. Drama / Play 23. Encyclopedia
24. Errata 25. Error Message / Empty Page / Under
Construction Page 26. Essay 27. Exercises
(Problems) 28. FAQ 29. Feature Story / News
Reportage 30. Game (Quiz, Puzzle) 31. Glossary
32. Guestbook 33. Homepage / Front Page / Entry
Page 34. Horoscope 35. Index 36. Instruction 37.
Interview 38. Invitation 39. Job Listing 40. Joke
41. Law / Regulation / Rule / Proclamation 42.
Letter / Mail / E-Mail 43. Letter to the Editor
44. Linkfarm 45. Link Collection / Hotlist 46.
List of Products 47. List of Projects 48. Login
Page 49. Media (Images, videos, music, sound) 50.
Meeting minutes 51. News Article 52. News
Collection / Newsletter / Digest 53. Obituary 54.
Of?cial Report 55. Ordering Form / Booking Form
56. Pamphlet 57. Petition 58. Promotional /
Advertisement 59. Poem / Poetry / Lyrics 60.
Pornographic 61. Prose Fiction 62. Quotation 63.
Reportage 64. Research Report 65. Review
(Testimonial) 66. Script (Manuscript) 67. Search
Form 68. Sermon 69. Shop 70. Speci?cation 71.
Speech 72. Splash Page / Gateway / Welcome Page
73. Strategic Plans 74. Survey 75. Table of
contents / Sitemap / Navigation 76. Thesis 77.
Travel Guide 78. Tutorial
19Tagging HTML Documents with Genre Categories
1) tag HTML documents the most common approach
tag
20Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
21Reference Collection of Web Documents
- We plan to build the reference corpus in two
stages - First, we will apply our shared set of genre
categories to existing collections as a proof of
concept. - Initial step towards an objective evaluation and
integrative compatibility of individual
approaches. - Second, we will use a crawler to gather more
recent as well as more diverse sets of documents.
22Reference Collection of Web Genres (Selection)
- Web Corpus for English (Santini, 2007)
editorial, biography, do-it-yourself guide,
feature article (20 web pages each). - German corpus (Mehler et al., 2007, 2008)
conference website (50 sites), personal academic
homepage (68 sites), project website (52 sites),
city website (180 sites). - Hierachical Web Genre Collection (Stubbe and
Ringlstetter, 2007), 32 genre classes, 40 HTML
?les/class, English. - Corpus of 400 blog posts, Italian (Tavosanis,
2007). - English (65,177 pages) and Russian (29,650 pages)
corpora (Sharoff, 2007).
23Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection of web documents
Shared genre category set or sets
Annotation tool
24Corpus Management and Annotation Tools
- Construction of the reference corpus requires
tools that support - compiling a document collection and
- annotating HTML documents.
- We use the HyGraph toolbox
- Supports researchers in the process of corpus
compilation, annotation and analysis - Annotate at various levels
- Assign confidence values
- Support for multiple tag setsand category
systems - Uses stand-off annotation
25Towards a Reference Corpus of Web Genres
Reference collection of web documents
Shared genre category set or sets
Annotation tool
Reference Corpus of Web Genres
26Summary and Future Work
- We construct a reference corpus of web genres.
- Provide a shared resource for researchers who
work on web genre identi?cation and the
evaluation of these systems. - Future work includes the further realisation of
this resource - Apply a set of genre categories to existing
corpora. - Collect a large set of new documents that will be
categorised based on annotation guidelines using
HyGraph. - Assign genre labels to single web documents first
and to page segments as well as complete websites
later.
27Q/A
- Thanks for your attention!
- Please get in touch if you (plan to) work in the
field of - automatic web genre identification or a related
area - georg.rehm_at_uni-tuebingen.de
- http//129.70.40.20/WebGenreWiki/
- A mailing list will be available soon.