New Challenges in Web Data Management - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

New Challenges in Web Data Management

Description:

Add structural information to user reviews (categories, sentiment) Use both structure and content to analyze, cluster, search reviews. Databases ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 21
Provided by: aml64
Category:

less

Transcript and Presenter's Notes

Title: New Challenges in Web Data Management


1
New Challenges in Web Data Management
  • Amélie Marian
  • Rutgers University

2
About me
  • Undergrad and Masters in France (1998 and 1999)
  • Spend 18 months at INRIA
  • Ph.D. at Columbia University (2005)
  • Evaluating Top-k Queries over
  • Structured and Semi-structured Data
  • Assistant Professor at Rutgers
  • (since 2005)

3
My Research Projects
  • WebCob Web Data Corroboration
  • Deal with conflicting information
  • Good information is reported by several
    trustworthy sources
  • Semi-structured Data Scoring
  • Multidimensional file system search
  • Top-k query processing
  • URSA Understanding Users Reviewing Patterns
  • Add structural information to user reviews
    (categories, sentiment)
  • Use both structure and content to analyze,
    cluster, search reviews

4
Databases
  • Evolved from file systems (1960s)
  • Relational DB systems (1970s)
  • Data organized in tables and relations that model
    real-world
  • Storage structure transparent to user
  • High-level query language
  • Widely used today
  • Many Challenges
  • Query Processing
  • Query Optimization
  • Transaction Processing
  • Schema Refinement
  • View Materialization

5
Information is everywhere
Music
Travel
Banking
Newspapers
Radio
Advertising
Video/TV
Email
Telephony
Manufacturing
Stock Market
Retail / POS
6
New Challenges
  • Data has changed
  • Not as rigidly structured as in the relational
    world
  • Comes from several sources
  • Complex data types (e.g., multimedia)
  • A lot of information is stored as text
  • Many different data providers

Web data is a perfect example of these new
challenges
7
Web 2.0
  • Data is not only generated by professionals but
    by everyone
  • Published content from traditional sources 3-4
    Gb/day
  • Professional web content 2 Gb/day
  • User-generated content 8-10 Gb/day
  • Private text content 3 Tb/day (200x more)
  • What can we do with this data?
  • User data is often free text hard to query and
    analyze
  • Source of data is often unknown/untrustworthy
  • Quality of the data is not good

Figures from Raghu Ramakrishan, Y!
8
Database Management Research
Web
  • How to improve search and surfing
  • Personalization
  • Social Data
  • Data Extraction
  • Collaborative Filtering
  • Data Quality
  • Large-scale Data Management
  • Scalability

9
Personalization
  • User preferences
  • User profile
  • History
  • Social Network
  • Geographical localization
  • Use of user profile, IP, GPS, (e.g., Google
    Local)

10
Social Data
  • Leverage community interactions to create and
    refine content
  • Enhance and amplify user interactions
  • Include sources of information
  • Experts, friends, sub-communities of shared
    interest

11
Data Extraction
  • Identify query answers/relevant information
  • Question answering/entity matching
  • Need to understand the structure of the data
  • Not as simple as it sounds
  • e.g., Where can I find good Italian food in
    Wilmington?
  • Use of Semantic Web
  • Domain-specific knowledge

12
Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
13
Collaborative Filtering
  • The Wisdom of Crowds
  • Not a new idea (see democracy)
  • This is already used in web search (PageRank)
  • Lots of user information that can be used
    reviews, feedback, tags, votes, etc.
  • Collaborative Editing
  • E.g., Wikipedia
  • Harnessing the collaborative power
  • E.g., Amazon Mechanical Turk

14
(No Transcript)
15
Data Quality
  • What is the quality of a web source?
  • Provenance/Lineage
  • Can we trace the source?
  • Confidence
  • How much do we trust the source?
  • Correlation
  • Did we agree with the source before?
  • ? Goes back to social networking

16
Example What is the gas mileage of my Honda
Civic?
  • Query honda civic 2007 gas mileage on MSN
    Search
  • Is the top hit the honda.com site unbiased?
  • Is the autoweb.com web site trustworthy?
  • Are all these values referring to the correct
    year of the model?

The truth is out there, but sometimes there are
multiple truths
Users may check several web sites to get an answer
17
Evaluation?
  • One of the main problem of current research
  • How do we evaluate the benefit/gain of the
    techniques
  • Companies have ways to estimate the impact of new
    algorithms
  • Click through
  • Ad revenue
  • Number of visitors
  • User feedback
  • Significantly harder for academic researchers
  • Often hard to identify the optimal answer
  • No access to unlimited data and, more
    importantly, users
  • Some benchmark available (TREC, INEX)
  • User experiments, Mechanical Turk

18
Database Research?
  • This is very linked to work done in other
    communities
  • Information Retrieval
  • Machine Learning
  • Artificial Intelligence
  • Natural Language Processing
  • There is some move for these communities to work
    together for a common goal
  • Still a long way to go
  • Different philosophies
  • Exciting collaboration opportunities!

19
Database Research
  • These are only a few of Database/Data Management
    challenges
  • Both on web data and on regular data
  • Some hot and exciting topics
  • Probabilistic Databases
  • Data Integration
  • Sensor Networks
  • Top-k/Ranked/Skyline Queries
  • Privacy
  • Streams

20
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com