Collaborative Publishing: Wiki and Wikipedia - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Collaborative Publishing: Wiki and Wikipedia

Description:

Collaborative Publishing: Wiki and Wikipedia By Qi Li What is Wiki? What is Wikipedia? Scenario 1 Question: I know there was an American President named Bush Task ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 68
Provided by: sisPittE9
Category:

less

Transcript and Presenter's Notes

Title: Collaborative Publishing: Wiki and Wikipedia


1
Collaborative Publishing Wiki and Wikipedia
  • By Qi Li

2
Agenda
  • Overview of Wiki and Wikipedia
  • Concepts Wiki, Wikipedia, Wiki foundation
  • Software and Hardware of Wikis
  • Size of Wikipedia
  • Knowledge Organization of Wikipedia
  • Improving Wikipedias Accurary
  • Wikipedia in Natural Language Processing

3
Overview of Wiki and Wikipedia
  • Reference
  • Keshava P Subramanya (keshava_at_cs.ucsb.edu)
  • Roopa Kannan (roopakannan_at_cs.ucsb.edu)

4
  • What is Wiki?
  • What is Wikipedia?

5
What is Wikipedia?
  • Wikipedia is a freely licensed encyclopedia
    written by thousands of volunteers in many
    languages
  • Free license allows others to freely copy,
    redistribute, and modify our work commercially or
    non-commercially
  • Founded January 15, 2001
  • wikipedia.org

6
What is wikis?
  • A wiki is software that allows users to create,
    edit, and link web pages easily.
  • Wikis are often used to create collaborative
    websites and to power community websites.
  • Ward Cunningham, developer of the first wiki,
    WikiWikiWeb, originally described it as "the
    simplest online database that could possibly
    work".

wikipedia.org
7
What is the Wikimedia Foundation?
  • Non-profit foundation
  • Aims to distribute a free encyclopedia to every
    single person on the planet in their own language
  • Wikipedia and its sister projects
  • Funded by public donations
  • http//wikimediafoundation.org/wiki/Donate
  • Applying for grants
  • wikimediafoundation.org

8
Wikimedia Projects
  • Wikipedia
  • Wiktionary
  • Wikibooks
  • Wikisource
  • Wikiquote
  • Wikispecies
  • Wikimedia Commons
  • Wikinews
  • .......

9
Wikimedia Foundation Governed by Board of
Directors (5 positions 1 permanent (Jimmy
Wales) 2 Bomis reps, 2 community
reps) Foundation coordinates official
(volunteer) positions Fundraising, legal,
technical development, press, etc
MediaWiki (software)
And the projects
Wiktionary Wikinews Wikipedia Wikiversity
Wikiquote Wikisource Commons
Local chapters English (en) German (de)
Italian (it) etc. 215 languages in total
10
Advantages of Free License
  • Remains non-proprietary
  • Decreases individual sense of ownership
  • Increases a sense of shared ownership
  • Enhances the popularity of Wikipedia
  • Attribution requirement extends brand

11
Free Software
  • MediaWiki is General Public License (GPL)
  • all free software on the website
  • GNU/Linux
  • Apache
  • MySQL
  • Php

12
Wikimedias Hardware
  • 40 servers
  • Squid caching servers in front to serve cached
    objects quickly
  • Apache/PHP webservers in the middle
  • Database backend (MySql)

13
(No Transcript)
14
How big is Wikipedia?
  • English Wikipedia is largest and has over 130
    million words
  • English Wikipedia larger than Britannica and
    Microsoft Encarta combined
  • In 15 months the publicly distributed compressed
    database dumps may reach 1 terabyte total size
  • http//en.wikipedia.org/wiki/WikipediaStatistics

15
How big is Wikipedia Globally?
  • English 533,000 articles
  • German 220,000 article
  • Japanese 110,000 articles
  • French 100,000 articles
  • Swedish 71,000 articles
  • Nearly 1.5 million across 200 languages
  • 20 with gt10,000. 50 with gt1000
  • http//en.wikipedia.org/wiki/SpecialSiteMatrix
  • http//Meta.wikimedia.org/wiki/Statistics
  • http//en.wikipedia.org/wiki/WikipediaMultilingua
    l_statistics

16
How popular is Wikipedia?
17
  • Knowledge Organizationwith Wikipedia
  • Reference
  • Jakob Voss Common Library Network (GBV) at 5th
    NKOS Workshop, Alicante September 21, 2006
  • Phoebe Ayers UC Davis, Physical Sciences
    Engineering Library, phoebe.ayers _at_ gmail.com
    en.wikipedia.org/wiki/UserPhoebe Ayers

Jakob Voss Knowledge Organization with
Wikipedia. 5th NKOS Workshop, Sep 21,2006
18
Scenario 1
  • Question
  • I know there was an American President named Bush
  • Task
  • Exact name?
  • Other information
  • How to search on Wikipedia?
  • How to search on Google?

19
Scenario 2
  • Title Pacific navigators Australia explorers
  • Entities
  • James Cook (15630)
  • La Pérouse (58090)
  • Tasman (1988)
  • Categories
  • explorers (5208)
  • Description
  • Find the navigators and explorers in the Pacific
    sea in search of Australia
  • Narrative
  • I am doing an essay on the explorers who
    discovered or charted Australia. I am already
    aware of TASMAN, COOK and La Pérouse and would
    like to get the full list of navigators who
    contributed to the discovery of Australia. Those
    for who there are disputes about their actual
    discovery of (parts of) Australia are still
    acceptable. I am mainly interested by the
    captains of the ships but other people who were
    on board with those navigators still relevant
    (naturalists or others). I am not interested in
    those who came later to settle in Australia.
  • Topic From INEX 2007

20
Scenario 3
  • School of Information Science
  • SIS?
  • Faulty?
  • Ronald L. Larsen
  • Peter Brusilovsky
  • .
  • How to communicate?
  • http//en.wikipedia.org/wiki/WikipediaAbout

21
What is Wikipedia namespaces
  • Main The main namespace or article namespace is
    the encyclopedia proper. It is the default
    namespace and does not use a prefix.
  • Portal (prefix Portal) is for reader-oriented
    portals that help readers find and browse through
    articles related to a specific subject.
  • User (prefix User) is a namespace that provides
    pages for Wikipedia users' personal presentations
    and auxiliary pages for personal use, for example
    containing bookmark to favorite pages.
  • Image (prefix Image, also called image
    description pages) is a namespace that provides
    info about images and sound clips, one page for
    each, with a link to the image or sound clip
    itself.

22
Wikipedia Namespace (cont.)
  • Category contains categories of pages, with each
    displaying a list of pages in that category and
    optional additional text.
  • Help the basic, technical features of Wikipedia.
  • Talk are used to discuss changes to the
    corresponding page in the associated namespace.
    Pages in the user talk namespace are used to
    leave messages for a particular user.
  • the talk namespace associated with the main
    article namespace has the prefix Talk,
  • while the talk namespace associated with the user
    namespace has the prefix User talk

23
Wikipedia Namespace (cont.)
  • MediaWiki (prefix MediaWiki) is a namespace
    containing interface texts such as link labels
    and messages. They are used for adjusting the
    localisation (i.e. local version) of interface
    messages without waiting for a new LanguageXx.php
    file to get installed.
  • Template (formerly part of the MediaWiki
    namespace) is used to define a standard text
    which can then be conveniently added within
    pages, either the text itself at the time of
    adding, or a reference to the text at the time of
    viewing the page. The latter way effectively
    changes all such occurrences of the standard text
    automatically by just editing the page where the
    text is defined. .
  • http//en.wikipedia.org/wiki/WikipediaNamespace

24
How do articles get written?
  • Someone starts it
  • Someone else checks it
  • A (possibly third) party edits it
  • http//en.wikipedia.org/wiki/HelpContents/Editing
    _Wikipedia

25
Article Criteria
  • Notable (encyclopedic)
  • Not vanity
  • Not duplication
  • Community consensus

26
(No Transcript)
27
http//en.wikipedia.org/wiki/WikipediaHow_to_edit
_a_page
28
(No Transcript)
29
(No Transcript)
30
Edit wars and other things that go boom
31
(No Transcript)
32
Predictable vandalism posted and reverted the
same minute (1031)
33
(No Transcript)
34
What is Collaborative Publishing?
  • Collaborative works are created by multiple
    people together rather than individually
  • Publishing knowledge
  • Some projects are overseen by an editor or
    editorial team
  • Many grow without any top-down oversight

35
Characteristics 1 access control
  • All users to edit any page but with control
    access
  • Control Access
  • http//en.wikipedia.org/wiki/WikipediaEditorial_o
    versight_and_controlTypes_of_access

36
Characteristics 2 Revision control
http//en.wikipedia.org/wiki/WikipediaEditorial_o
versight_and_controlTypes_of_access
37
Wikipedias Accuracy
38
Criticisms
  • Could a collaborative project that anyone can
    edit be a public good?
  • Contribute articles
  • Quality of articles is close to Encyclopaedia
    Britannica
  • Vandalism
  • Creeping bureaucracy growing instances of
    infighting among editors
  • The communitys anti-intellectual attitude
  • digital Maoism
  • faith-based encyclopaedia
  • http//en.wikipedia.org/wiki/WikipediaAbout

39
Further criticisms
  • Entries for pop cultural figures vs. those for
    great literary figures, scientists, etc.
  • Entry for Britney Spears longer than entry for
    St. Augustine
  • Seinfeld longer than Shakespeare Barbie longer
    than Bellow
  • Response Nothing to get exercised about

40
80/10 Rule
  • Counting only logged in users, and even excluding
    some prominent approved bot users
  • 10 percent of all users make 80 of all edits
  • 5 percent of all users make 66 of edits
  • Half of all edits are made by just 2 1/2 percent
    of all users

41
Edits by Anons
  • Controversial, intruiging
  • Yes, you can edit this page
  • Without logging in!

42
Edits by Anons -
  • Anonymous ip numbers can edit Wikipedia, and do
  • But these edits make up a total of around 18 of
    all edits, with some evidence of a downward trend
    over time
  • Anecdotally, many regular users report sometimes
    editing anonymously by accident or as a quiet
    form of Sock Puppeting

43
Edits across namespaces
  • Articles 85
  • Talk pages 8
  • User Page 3
  • User Talk Pages 4
  • These percentages are stable in 2003
  • And 2004

44
Studying the Accuracy of Wikipedia
  • Study by Nature
  • factual errors, omissions or misleading
    statements Wikipedia vs Britannica 162 vs 123
    major 4 vs 4
  • Survey whether they think sample articles are
    accurate
  • 76 -- accurate

45
Separate the wheat from the chaff
  • Proposal 1 Based on explicit article validation
  • trusted user (defined using various criteria)
    explicitly marks an article as good
  • Peer-based explicit system allow users to choose
    which of their peers to trust, thus providing
    different results for each user
  • Shortage explicit input from reviews

46
  • Proposal 2 automatically assess information
    quality by calculating metrics based on metadata
    recorded and stored by Wikipedia
  • Metrics of edits made for the article and of
    unique editors for the article
  • Distinguish of two classes of pages
  • Link ratio analysis
  • Quality of editors
  • Trustworthiness or reputation of authors and
    articles
  • Segments instead of articles

47
  • Surprisingly successful
  • Large/Complete/Coverage
  • Again Free

48
References
  • Cohen, Noam. Courts Turn to Wikipedia, but
    Selectively. The New York Times January 29
    (2007) Section C, page 3.
  • Economist. Battle of Britannica. Economist
    378.8471 (April 1, 2006) 65-66.
  • Fallis, Don. The Epistemic Benefits and Costs
    and Collaboration. Southern Journal of
    Philosophy 44.S (2006) 197-208.
  • Fallis, Don. On Verifying the Accuracy of
    Information Philosophical Perspectives. Library
    Trends 52.3 (2004) 463-487.
  • Fricke, Martin and Don Fallis. Indicators of
    Accuracy of Consumer Health Information on the
    Internet. Journal of the American Medical
    Informatics Association 9 (2002) 73-79.
  • Giles, J. Internet Encyclopedias Go Head to
    Head. Nature 438.7069 (December 15, 2005)
    900-901.
  • Hettinger, Edwin. Justifying Intellectual
    Property. Philosophy and Public Affairs 18
    (1989) 31-52.
  • Paine, Lynn Sharp. Trade Secrets and the
    Justification of Intellectual Property A Comment
    on Hettinger. Philosophy and Public Affairs 20
    (1991) 247-263.
  • Poe, Marshall. The Hive. Atlantic Monthly 298.2
    (September 2006) 86-94.
  • Resnik, David. A Pluralistic Account of
    Intellectual Property. Journal of Business
    Ethics 46 (2003) 319-335.
  • Schiff, Stacy. Know it All. New Yorker 82.23
    (July 31, 2006).
  • Sunstein, Cass. Mobbed up. New Republic 230.24
    (June 28, 2004) 40-45.
  • Surowieki, James. The Wisdom of Crowds. New York
    Anchor Books, 2004.

49
Wikipedia in NLP
  • Ontology
  • Thesauri
  • Categorization
  • Topic Detection
  • Information Retrieval (Query Expansion)
  • Word Sense Disambiguation
  • Question Answer
  • Translation (CLIR)

50
Wikitology !
  • Using Wikipedia as an ontology offers the best of
    both approaches
  • Each article is a concept in the ontology
  • Terms linked via Wikipedias category system and
    inter-article links
  • Its a consensus ontology created, kept current
    and maintained by a diverse community
  • Overall content quality is high

51
Wikitology features
  • Terms have unique IDs (URLs) and are self
    describing for people
  • Several underlying graphs provide structure
    categories, article links
  • Article history contains useful meta-data (e.g.,
    for trust)
  • External sources provide more info (e.g.,
    Googles pagerank)
  • Some of the data available in structured form,
    e.g., in RDF from DBpedia

52
Semantic Wikipedia
Völkel et al (2006) Semantic Wikipedia. WWW2006
conference
  • Typed links is capital ofEnglandgt RDF
    triples

53
Thesauri
  • Reference
  • Mining Domain-Specific Thesauri from Wikipedia A
    case study, Milne, D., Medelyan, O., and Witten,
    H. 2006. Proceedings of the 2006 IEEE/WIC/ACM
    International Conference on Web Intelligence
  • Milne, D., Witten, I. H., Nichols, D. M.
    (2007). Extracting corpus specific knowledge
    bases from Wikipedia. CIKM. Lisbon, Portugal.

54
Thesauri
  • Thesauri
  • an indexed compilation of words with similar,
    related, broader, narrower and opposite meanings.
  • Wikipedia
  • Each article - a concept
  • Hyperlinks - relations
  • Equivalence - USE, USE FOR
  • Hierarchical - BT, NT
  • Associative - RT

55
Topic Detection
  • Reference
  • Identifying document topics using the Wikipedia
    category network, Peter Schonhofen, Proceedings
    of the 2006 IEEE/ACM International Conference on
    Web Intelligence (WI 2006 Main Conference
    Proceedings)

56
  • Topic Detection
  • To detect concepts in the document
  • select the most dominant concepts to present the
    document.
  • Ontology from wikipedia
  • Coverage of wikipedia is general purpose and very
    wide,
  • Structure is rich and consistent

57
Wikipedia structure
  • Components articles, images pages, discussion
    about article contents, authors, page component
    templates and so on.
  • Articles titles, categories, refer to other
    articles
  • Categories hierarchically into sub- and
    super-categories (not just tree)
  • Author links between articles, hierarchy of
    categories.

58
Wikipedia structure
59
Wikipedia for classification
  • Reference
  • Overcoming the Brittleness Bottleneck using
    Wikipedia Enhancing Text Categorization with
    Encyclopedic Knowledge. Engeniy Gabrilovich and
    Shaul Markovitch American Association for
    Artificial Intelligence 2006
  • Benerjee, S., Ramanthan, K., Gupta, A. (2007),
    Clustering short text using Wikipedia, SIGIR
  • Meyer, M., Rensing, C. (2007). Categorizing
    Learning Objects based on Wikiepdia as Substitue
    Corpus. Proceedings of the First International
    Workshop on Learning Object Discovery and
    Exchange.

60
Text Classification
  • Deals with automatic assignment of category
    labels to natural language documents
  • Represent document as bags of words
  • Features from words
  • Limitation of BOW
  • by individual word occurrences in the training
    set
  • Wal-Mart supply chain goes real time
  • Wal-Mart manages its stock with RFID technology
  • Effective in medium difficulty categorization,
    but bad in small categories or short documents
  • Using encyclopedia to endow the machine document
    with the broader of knowledge available to humans

61
  • Auxiliary text classifier
  • matching documents with the most relevant
    articles of wikipedia
  • conventional bag of words new features
  • Examples for idea of auxiliary text classifier
  • Bernanke takes charge
  • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE
    FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM,
  • Using wikipedia
  • Use text similarity algorithms to automatically
    identify encyclopedia articles relevant to each
    document
  • Leverage the knowledge gained from these articles

62
Word Sense Disambiguation
  • Some names denote multiple entities
  • John Williams and the Boston Pops conducted a
    summer Star Wars concert at Tanglewood.
  • John Williams ? John Williams (composer)
  • John Williams lost a Taipei death match against
    his brother, Axl Rotten.
  • John Williams ? John Williams (wrestler)
  • John Williams won a Victoria Cross for his
    actions at the battle of Rorkes Drift.
  • John Williams ? John Williams (VC)

63
  • Some entities have multiple names
  • John Williams (composer) ? John Williams
  • John Williams (composer) ? John Towner Williams
  • John Williams (wrestler) ? John Williams
  • John Williams (wrestler) ? Ian Rotten
  • Venus (planet) ? Venus
  • Venus (planet) ? Morning Star
  • Venus (planet) ? Evening Star

64
WSD
  • Web searches
  • Queries about Named Entities (NEs) constitute a
    significant portion of popular web queries.
  • Ideally, search results are clustered such that
  • In each cluster, the queried name denotes the
    same entity.
  • Each cluster is enriched by querying the web with
    alternative names of the corresponding entity.
  • Web-based Information Extraction (IE)
  • Aggregating extractions from multiple web pages
    can lead to improved accuracy in IE tasks (e.g.
    extracting relationships between NEs).
  • Named entity disambiguation is essential for
    performing a meaningful aggregation.

65
Wikipedia Structures
  • In general, there is a many-to-many relationship
    between names and entities, captured in Wikipedia
    through
  • Redirect articles.
  • Disambiguation articles.
  • Hyperlinks An article may contain links to other
    articles in Wikipedia.
  • Categories each article belongs to at least one
    Wikipedia category.

66
Redirect Articles
  • Redirect article
  • exists for each alternative name used to refer to
    an entity in Wikipedia.
  • Example The article titled John Towner Williams
    consists in a pointer to the article John
    Williams (composer).
  • Disambiguation article
  • lists all Wikipedia entities (articles) that may
    be denoted by an ambiguous name.
  • Example The article titled John Williams
    (disambiguation) list 22 entities (articles).

67
Conclusion
  • Overview of Wikipedia
  • Knowledge organization in Wikipedia
  • Accuracy of Wikipedia
  • Application of Wikipedia in NLP
Write a Comment
User Comments (0)
About PowerShow.com