Title: Collaborative Publishing: Wiki and Wikipedia
1Collaborative Publishing Wiki and Wikipedia
2Agenda
- Overview of Wiki and Wikipedia
- Concepts Wiki, Wikipedia, Wiki foundation
- Software and Hardware of Wikis
- Size of Wikipedia
- Knowledge Organization of Wikipedia
- Improving Wikipedias Accurary
- Wikipedia in Natural Language Processing
3Overview of Wiki and Wikipedia
- Reference
- Keshava P Subramanya (keshava_at_cs.ucsb.edu)
- Roopa Kannan (roopakannan_at_cs.ucsb.edu)
4- What is Wiki?
- What is Wikipedia?
5What is Wikipedia?
- Wikipedia is a freely licensed encyclopedia
written by thousands of volunteers in many
languages - Free license allows others to freely copy,
redistribute, and modify our work commercially or
non-commercially - Founded January 15, 2001
- wikipedia.org
6What is wikis?
- A wiki is software that allows users to create,
edit, and link web pages easily. - Wikis are often used to create collaborative
websites and to power community websites. - Ward Cunningham, developer of the first wiki,
WikiWikiWeb, originally described it as "the
simplest online database that could possibly
work".
wikipedia.org
7What is the Wikimedia Foundation?
- Non-profit foundation
- Aims to distribute a free encyclopedia to every
single person on the planet in their own language - Wikipedia and its sister projects
- Funded by public donations
- http//wikimediafoundation.org/wiki/Donate
- Applying for grants
- wikimediafoundation.org
8Wikimedia Projects
- Wikipedia
- Wiktionary
- Wikibooks
- Wikisource
- Wikiquote
- Wikispecies
- Wikimedia Commons
- Wikinews
- .......
9Wikimedia Foundation Governed by Board of
Directors (5 positions 1 permanent (Jimmy
Wales) 2 Bomis reps, 2 community
reps) Foundation coordinates official
(volunteer) positions Fundraising, legal,
technical development, press, etc
MediaWiki (software)
And the projects
Wiktionary Wikinews Wikipedia Wikiversity
Wikiquote Wikisource Commons
Local chapters English (en) German (de)
Italian (it) etc. 215 languages in total
10Advantages of Free License
- Remains non-proprietary
- Decreases individual sense of ownership
- Increases a sense of shared ownership
- Enhances the popularity of Wikipedia
- Attribution requirement extends brand
11Free Software
- MediaWiki is General Public License (GPL)
- all free software on the website
- GNU/Linux
- Apache
- MySQL
- Php
12Wikimedias Hardware
- 40 servers
- Squid caching servers in front to serve cached
objects quickly - Apache/PHP webservers in the middle
- Database backend (MySql)
13(No Transcript)
14How big is Wikipedia?
- English Wikipedia is largest and has over 130
million words - English Wikipedia larger than Britannica and
Microsoft Encarta combined - In 15 months the publicly distributed compressed
database dumps may reach 1 terabyte total size - http//en.wikipedia.org/wiki/WikipediaStatistics
15How big is Wikipedia Globally?
- English 533,000 articles
- German 220,000 article
- Japanese 110,000 articles
- French 100,000 articles
- Swedish 71,000 articles
- Nearly 1.5 million across 200 languages
- 20 with gt10,000. 50 with gt1000
- http//en.wikipedia.org/wiki/SpecialSiteMatrix
- http//Meta.wikimedia.org/wiki/Statistics
- http//en.wikipedia.org/wiki/WikipediaMultilingua
l_statistics
16How popular is Wikipedia?
17- Knowledge Organizationwith Wikipedia
- Reference
- Jakob Voss Common Library Network (GBV) at 5th
NKOS Workshop, Alicante September 21, 2006 - Phoebe Ayers UC Davis, Physical Sciences
Engineering Library, phoebe.ayers _at_ gmail.com
en.wikipedia.org/wiki/UserPhoebe Ayers
Jakob Voss Knowledge Organization with
Wikipedia. 5th NKOS Workshop, Sep 21,2006
18Scenario 1
- Question
- I know there was an American President named Bush
- Task
- Exact name?
- Other information
- How to search on Wikipedia?
- How to search on Google?
19Scenario 2
- Title Pacific navigators Australia explorers
- Entities
- James Cook (15630)
- La Pérouse (58090)
- Tasman (1988)
- Categories
- explorers (5208)
- Description
- Find the navigators and explorers in the Pacific
sea in search of Australia - Narrative
- I am doing an essay on the explorers who
discovered or charted Australia. I am already
aware of TASMAN, COOK and La Pérouse and would
like to get the full list of navigators who
contributed to the discovery of Australia. Those
for who there are disputes about their actual
discovery of (parts of) Australia are still
acceptable. I am mainly interested by the
captains of the ships but other people who were
on board with those navigators still relevant
(naturalists or others). I am not interested in
those who came later to settle in Australia. - Topic From INEX 2007
20Scenario 3
- School of Information Science
- SIS?
- Faulty?
- Ronald L. Larsen
- Peter Brusilovsky
- .
- How to communicate?
- http//en.wikipedia.org/wiki/WikipediaAbout
21What is Wikipedia namespaces
- Main The main namespace or article namespace is
the encyclopedia proper. It is the default
namespace and does not use a prefix. - Portal (prefix Portal) is for reader-oriented
portals that help readers find and browse through
articles related to a specific subject. - User (prefix User) is a namespace that provides
pages for Wikipedia users' personal presentations
and auxiliary pages for personal use, for example
containing bookmark to favorite pages. - Image (prefix Image, also called image
description pages) is a namespace that provides
info about images and sound clips, one page for
each, with a link to the image or sound clip
itself.
22Wikipedia Namespace (cont.)
- Category contains categories of pages, with each
displaying a list of pages in that category and
optional additional text. - Help the basic, technical features of Wikipedia.
- Talk are used to discuss changes to the
corresponding page in the associated namespace.
Pages in the user talk namespace are used to
leave messages for a particular user. - the talk namespace associated with the main
article namespace has the prefix Talk, - while the talk namespace associated with the user
namespace has the prefix User talk
23Wikipedia Namespace (cont.)
- MediaWiki (prefix MediaWiki) is a namespace
containing interface texts such as link labels
and messages. They are used for adjusting the
localisation (i.e. local version) of interface
messages without waiting for a new LanguageXx.php
file to get installed. - Template (formerly part of the MediaWiki
namespace) is used to define a standard text
which can then be conveniently added within
pages, either the text itself at the time of
adding, or a reference to the text at the time of
viewing the page. The latter way effectively
changes all such occurrences of the standard text
automatically by just editing the page where the
text is defined. . - http//en.wikipedia.org/wiki/WikipediaNamespace
24How do articles get written?
- Someone starts it
- Someone else checks it
- A (possibly third) party edits it
- http//en.wikipedia.org/wiki/HelpContents/Editing
_Wikipedia
25Article Criteria
- Notable (encyclopedic)
- Not vanity
- Not duplication
- Community consensus
26(No Transcript)
27http//en.wikipedia.org/wiki/WikipediaHow_to_edit
_a_page
28(No Transcript)
29(No Transcript)
30Edit wars and other things that go boom
31(No Transcript)
32Predictable vandalism posted and reverted the
same minute (1031)
33(No Transcript)
34What is Collaborative Publishing?
- Collaborative works are created by multiple
people together rather than individually - Publishing knowledge
- Some projects are overseen by an editor or
editorial team - Many grow without any top-down oversight
35Characteristics 1 access control
- All users to edit any page but with control
access - Control Access
- http//en.wikipedia.org/wiki/WikipediaEditorial_o
versight_and_controlTypes_of_access
36Characteristics 2 Revision control
http//en.wikipedia.org/wiki/WikipediaEditorial_o
versight_and_controlTypes_of_access
37Wikipedias Accuracy
38Criticisms
- Could a collaborative project that anyone can
edit be a public good? - Contribute articles
- Quality of articles is close to Encyclopaedia
Britannica - Vandalism
- Creeping bureaucracy growing instances of
infighting among editors - The communitys anti-intellectual attitude
- digital Maoism
- faith-based encyclopaedia
- http//en.wikipedia.org/wiki/WikipediaAbout
39Further criticisms
- Entries for pop cultural figures vs. those for
great literary figures, scientists, etc. - Entry for Britney Spears longer than entry for
St. Augustine - Seinfeld longer than Shakespeare Barbie longer
than Bellow - Response Nothing to get exercised about
4080/10 Rule
- Counting only logged in users, and even excluding
some prominent approved bot users - 10 percent of all users make 80 of all edits
- 5 percent of all users make 66 of edits
- Half of all edits are made by just 2 1/2 percent
of all users
41Edits by Anons
- Controversial, intruiging
- Yes, you can edit this page
- Without logging in!
42Edits by Anons -
- Anonymous ip numbers can edit Wikipedia, and do
- But these edits make up a total of around 18 of
all edits, with some evidence of a downward trend
over time - Anecdotally, many regular users report sometimes
editing anonymously by accident or as a quiet
form of Sock Puppeting
43Edits across namespaces
- Articles 85
- Talk pages 8
- User Page 3
- User Talk Pages 4
- These percentages are stable in 2003
- And 2004
44Studying the Accuracy of Wikipedia
- Study by Nature
- factual errors, omissions or misleading
statements Wikipedia vs Britannica 162 vs 123
major 4 vs 4 - Survey whether they think sample articles are
accurate - 76 -- accurate
45Separate the wheat from the chaff
- Proposal 1 Based on explicit article validation
- trusted user (defined using various criteria)
explicitly marks an article as good - Peer-based explicit system allow users to choose
which of their peers to trust, thus providing
different results for each user - Shortage explicit input from reviews
46- Proposal 2 automatically assess information
quality by calculating metrics based on metadata
recorded and stored by Wikipedia - Metrics of edits made for the article and of
unique editors for the article - Distinguish of two classes of pages
- Link ratio analysis
- Quality of editors
- Trustworthiness or reputation of authors and
articles - Segments instead of articles
47- Surprisingly successful
- Large/Complete/Coverage
- Again Free
48References
- Cohen, Noam. Courts Turn to Wikipedia, but
Selectively. The New York Times January 29
(2007) Section C, page 3. - Economist. Battle of Britannica. Economist
378.8471 (April 1, 2006) 65-66. - Fallis, Don. The Epistemic Benefits and Costs
and Collaboration. Southern Journal of
Philosophy 44.S (2006) 197-208. - Fallis, Don. On Verifying the Accuracy of
Information Philosophical Perspectives. Library
Trends 52.3 (2004) 463-487. - Fricke, Martin and Don Fallis. Indicators of
Accuracy of Consumer Health Information on the
Internet. Journal of the American Medical
Informatics Association 9 (2002) 73-79. - Giles, J. Internet Encyclopedias Go Head to
Head. Nature 438.7069 (December 15, 2005)
900-901. - Hettinger, Edwin. Justifying Intellectual
Property. Philosophy and Public Affairs 18
(1989) 31-52. - Paine, Lynn Sharp. Trade Secrets and the
Justification of Intellectual Property A Comment
on Hettinger. Philosophy and Public Affairs 20
(1991) 247-263. - Poe, Marshall. The Hive. Atlantic Monthly 298.2
(September 2006) 86-94. - Resnik, David. A Pluralistic Account of
Intellectual Property. Journal of Business
Ethics 46 (2003) 319-335. - Schiff, Stacy. Know it All. New Yorker 82.23
(July 31, 2006). - Sunstein, Cass. Mobbed up. New Republic 230.24
(June 28, 2004) 40-45. - Surowieki, James. The Wisdom of Crowds. New York
Anchor Books, 2004.
49Wikipedia in NLP
- Ontology
- Thesauri
- Categorization
- Topic Detection
- Information Retrieval (Query Expansion)
- Word Sense Disambiguation
- Question Answer
- Translation (CLIR)
50Wikitology !
- Using Wikipedia as an ontology offers the best of
both approaches - Each article is a concept in the ontology
- Terms linked via Wikipedias category system and
inter-article links - Its a consensus ontology created, kept current
and maintained by a diverse community - Overall content quality is high
51Wikitology features
- Terms have unique IDs (URLs) and are self
describing for people - Several underlying graphs provide structure
categories, article links - Article history contains useful meta-data (e.g.,
for trust) - External sources provide more info (e.g.,
Googles pagerank) - Some of the data available in structured form,
e.g., in RDF from DBpedia
52Semantic Wikipedia
Völkel et al (2006) Semantic Wikipedia. WWW2006
conference
- Typed links is capital ofEnglandgt RDF
triples
53Thesauri
- Reference
- Mining Domain-Specific Thesauri from Wikipedia A
case study, Milne, D., Medelyan, O., and Witten,
H. 2006. Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence - Milne, D., Witten, I. H., Nichols, D. M.
(2007). Extracting corpus specific knowledge
bases from Wikipedia. CIKM. Lisbon, Portugal.
54Thesauri
- Thesauri
- an indexed compilation of words with similar,
related, broader, narrower and opposite meanings. - Wikipedia
- Each article - a concept
- Hyperlinks - relations
- Equivalence - USE, USE FOR
- Hierarchical - BT, NT
- Associative - RT
55Topic Detection
- Reference
- Identifying document topics using the Wikipedia
category network, Peter Schonhofen, Proceedings
of the 2006 IEEE/ACM International Conference on
Web Intelligence (WI 2006 Main Conference
Proceedings)
56- Topic Detection
- To detect concepts in the document
- select the most dominant concepts to present the
document. - Ontology from wikipedia
- Coverage of wikipedia is general purpose and very
wide, - Structure is rich and consistent
57Wikipedia structure
- Components articles, images pages, discussion
about article contents, authors, page component
templates and so on. - Articles titles, categories, refer to other
articles - Categories hierarchically into sub- and
super-categories (not just tree) - Author links between articles, hierarchy of
categories.
58Wikipedia structure
59Wikipedia for classification
- Reference
- Overcoming the Brittleness Bottleneck using
Wikipedia Enhancing Text Categorization with
Encyclopedic Knowledge. Engeniy Gabrilovich and
Shaul Markovitch American Association for
Artificial Intelligence 2006 - Benerjee, S., Ramanthan, K., Gupta, A. (2007),
Clustering short text using Wikipedia, SIGIR - Meyer, M., Rensing, C. (2007). Categorizing
Learning Objects based on Wikiepdia as Substitue
Corpus. Proceedings of the First International
Workshop on Learning Object Discovery and
Exchange.
60Text Classification
- Deals with automatic assignment of category
labels to natural language documents - Represent document as bags of words
- Features from words
- Limitation of BOW
- by individual word occurrences in the training
set - Wal-Mart supply chain goes real time
- Wal-Mart manages its stock with RFID technology
- Effective in medium difficulty categorization,
but bad in small categories or short documents - Using encyclopedia to endow the machine document
with the broader of knowledge available to humans
61- Auxiliary text classifier
- matching documents with the most relevant
articles of wikipedia - conventional bag of words new features
- Examples for idea of auxiliary text classifier
- Bernanke takes charge
- BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE
FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, - Using wikipedia
- Use text similarity algorithms to automatically
identify encyclopedia articles relevant to each
document - Leverage the knowledge gained from these articles
62Word Sense Disambiguation
- Some names denote multiple entities
- John Williams and the Boston Pops conducted a
summer Star Wars concert at Tanglewood. - John Williams ? John Williams (composer)
- John Williams lost a Taipei death match against
his brother, Axl Rotten. - John Williams ? John Williams (wrestler)
- John Williams won a Victoria Cross for his
actions at the battle of Rorkes Drift. - John Williams ? John Williams (VC)
63- Some entities have multiple names
- John Williams (composer) ? John Williams
- John Williams (composer) ? John Towner Williams
- John Williams (wrestler) ? John Williams
- John Williams (wrestler) ? Ian Rotten
- Venus (planet) ? Venus
- Venus (planet) ? Morning Star
- Venus (planet) ? Evening Star
64WSD
- Web searches
- Queries about Named Entities (NEs) constitute a
significant portion of popular web queries. - Ideally, search results are clustered such that
- In each cluster, the queried name denotes the
same entity. - Each cluster is enriched by querying the web with
alternative names of the corresponding entity. - Web-based Information Extraction (IE)
- Aggregating extractions from multiple web pages
can lead to improved accuracy in IE tasks (e.g.
extracting relationships between NEs). - Named entity disambiguation is essential for
performing a meaningful aggregation.
65Wikipedia Structures
- In general, there is a many-to-many relationship
between names and entities, captured in Wikipedia
through - Redirect articles.
- Disambiguation articles.
- Hyperlinks An article may contain links to other
articles in Wikipedia. - Categories each article belongs to at least one
Wikipedia category.
66Redirect Articles
- Redirect article
- exists for each alternative name used to refer to
an entity in Wikipedia. - Example The article titled John Towner Williams
consists in a pointer to the article John
Williams (composer). - Disambiguation article
- lists all Wikipedia entities (articles) that may
be denoted by an ambiguous name. - Example The article titled John Williams
(disambiguation) list 22 entities (articles).
67Conclusion
- Overview of Wikipedia
- Knowledge organization in Wikipedia
- Accuracy of Wikipedia
- Application of Wikipedia in NLP