Title: New Challenges in Web Data Management
1New Challenges in Web Data Management
- Amélie Marian
- Rutgers University
2About me
- Undergrad and Masters in France (1998 and 1999)
- Spend 18 months at INRIA
- Ph.D. at Columbia University (2005)
- Evaluating Top-k Queries over
- Structured and Semi-structured Data
- Assistant Professor at Rutgers
- (since 2005)
3My Research Projects
- WebCob Web Data Corroboration
- Deal with conflicting information
- Good information is reported by several
trustworthy sources - Semi-structured Data Scoring
- Multidimensional file system search
- Top-k query processing
- URSA Understanding Users Reviewing Patterns
- Add structural information to user reviews
(categories, sentiment) - Use both structure and content to analyze,
cluster, search reviews
4Databases
- Evolved from file systems (1960s)
- Relational DB systems (1970s)
- Data organized in tables and relations that model
real-world - Storage structure transparent to user
- High-level query language
- Widely used today
- Many Challenges
- Query Processing
- Query Optimization
- Transaction Processing
- Schema Refinement
- View Materialization
5Information is everywhere
Music
Travel
Banking
Newspapers
Radio
Advertising
Video/TV
Email
Telephony
Manufacturing
Stock Market
Retail / POS
6New Challenges
- Data has changed
- Not as rigidly structured as in the relational
world - Comes from several sources
- Complex data types (e.g., multimedia)
- A lot of information is stored as text
- Many different data providers
Web data is a perfect example of these new
challenges
7Web 2.0
- Data is not only generated by professionals but
by everyone - Published content from traditional sources 3-4
Gb/day - Professional web content 2 Gb/day
- User-generated content 8-10 Gb/day
- Private text content 3 Tb/day (200x more)
- What can we do with this data?
- User data is often free text hard to query and
analyze - Source of data is often unknown/untrustworthy
- Quality of the data is not good
Figures from Raghu Ramakrishan, Y!
8Database Management Research
Web
- How to improve search and surfing
- Personalization
- Social Data
- Data Extraction
- Collaborative Filtering
- Data Quality
- Large-scale Data Management
- Scalability
9Personalization
- User preferences
- User profile
- History
- Social Network
- Geographical localization
- Use of user profile, IP, GPS, (e.g., Google
Local)
10Social Data
- Leverage community interactions to create and
refine content - Enhance and amplify user interactions
- Include sources of information
- Experts, friends, sub-communities of shared
interest
11Data Extraction
- Identify query answers/relevant information
- Question answering/entity matching
- Need to understand the structure of the data
- Not as simple as it sounds
- e.g., Where can I find good Italian food in
Wilmington? - Use of Semantic Web
- Domain-specific knowledge
12Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
13Collaborative Filtering
- The Wisdom of Crowds
- Not a new idea (see democracy)
- This is already used in web search (PageRank)
- Lots of user information that can be used
reviews, feedback, tags, votes, etc. - Collaborative Editing
- E.g., Wikipedia
- Harnessing the collaborative power
- E.g., Amazon Mechanical Turk
14(No Transcript)
15Data Quality
- What is the quality of a web source?
- Provenance/Lineage
- Can we trace the source?
- Confidence
- How much do we trust the source?
- Correlation
- Did we agree with the source before?
- ? Goes back to social networking
16Example What is the gas mileage of my Honda
Civic?
- Query honda civic 2007 gas mileage on MSN
Search - Is the top hit the honda.com site unbiased?
- Is the autoweb.com web site trustworthy?
- Are all these values referring to the correct
year of the model?
The truth is out there, but sometimes there are
multiple truths
Users may check several web sites to get an answer
17Evaluation?
- One of the main problem of current research
- How do we evaluate the benefit/gain of the
techniques - Companies have ways to estimate the impact of new
algorithms - Click through
- Ad revenue
- Number of visitors
- User feedback
- Significantly harder for academic researchers
- Often hard to identify the optimal answer
- No access to unlimited data and, more
importantly, users - Some benchmark available (TREC, INEX)
- User experiments, Mechanical Turk
18Database Research?
- This is very linked to work done in other
communities - Information Retrieval
- Machine Learning
- Artificial Intelligence
- Natural Language Processing
- There is some move for these communities to work
together for a common goal - Still a long way to go
- Different philosophies
- Exciting collaboration opportunities!
19Database Research
- These are only a few of Database/Data Management
challenges - Both on web data and on regular data
- Some hot and exciting topics
- Probabilistic Databases
- Data Integration
- Sensor Networks
- Top-k/Ranked/Skyline Queries
- Privacy
- Streams
-
20Thank you!