Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithms and Data Structures

Description:

Title: Topical clustering of search results Author: Ugo Last modified by: Rossano Venturini Document presentation format: Presentazione su schermo (4:3) – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 11
Provided by: Ugo46
Category:

less

Transcript and Presenter's Notes

Title: Algorithms and Data Structures


1
  • Algorithms and Data Structures
  • for Massive Datasets
  • (Acube Lab)

Rossano Venturini Dipartimento di
Informatica Università di Pisa
Paolo Ferragina Giuseppe Prencipe Marco
Cornolti Andrea Farruggia Giovanni Micale
Francesco Piccinno Giorgio Audrito
2
A3 Lab (acube.di.unipi.it)
  • Algorithms and data structures for massive
    dataset
  • Data Compression
  • Compressed Indexing
  • Web or arbitrary texts
  • Storage and analysis of massive graphs
  • Information Retrieval on news, tweet,

Submitted US patents 3 with Yahoo, 1 with
NYU Accepted US patents 1 with U. Rutgers, 1
with ATT-Lucent
3
Social Networks and Social Data
  • Given an idea, you need the right platform to
    implement it
  • HW SW (IT Center)
  • Algorithms (our Lab)
  • Graph structure Textual Content
  • Nodes ? users ( 1 bil)
  • Edges explicit friend, follower, retweet, 1,
    ( 10 bil)
  • Edges implicit similarity, co-occurrence,
    click, ( 100 bil)

4
No SQL
2006
Hadoop
Cassandra
HyperTable
Cosmos
5
Storage and access to Labeled Graphs
  • Compress the graph structure
  • Compress the node and edge labels
  • Guarantee fast access, dynamicity and search

5
6
Data Compression Theory Engineering
  • J. ACM 05
  • ACM-SIAM Soda 09-14
  • ACM WSDM 10
  • ESA 11-14
  • Algorithmica 12
  • SIAM J. Computing 13
  • Key issue
  • Minimize space occupancy
  • Maximize decompression speed

Compressor on DBLP Compressed space (MB) Decompression time (secs)
Gzip 191 11.6
bzip2 121 49
Snappy 323 2.1
LZ4 215 1.9
Our result 130 ? 149 2.9 ? 1.9
A new algorithmic concept Multi-objective design
of compressors
  • Two interesting scenarios
  • - Energy-efficiency issues
  • - Cloud computing

Can we fix the space occupancy and minimize the
decompression time ? Or, vice versa ?
7
Compressed Indexing Theory Engineering
  • J. ACM 05
  • ACM SIGIR 07
  • J. ACM 09
  • ACM Trans. Algo. 10
  • ESA 13
  • ACM-SIAM SODA 13
  • and many others
  • Key issue
  • Minimize space occupancy
  • Maximize substring-search throughput

Suffix-array compressible - Bzip searchable
  • Performance over hundreds of MBs and commodity PC
  • Count(P) takes 5 microsecs/char, taking about
    bzips space
  • Locate(P) outputs 100K occ/sec, taking 10
    space
  • This may be 4x faster than IL, within lt35 space
    occupancy

8
Compressed Indexing Theory Engineering
No SQL DB
The ltkey,valuegt problem
  • Trie 14x more space than input data.
  • Front-coding two-level indexing
  • 110 of input data
  • 4 microsecs/char
  • Our Compressed Permuterm
  • lt 25 of input data, i.e. close to bzip2
  • 10?60 microsecs/char
  • So, time close to FC but one-fourth of its space

Under Y!-patenting
9
We know how to manage everything
10
Information Retrieval
  • Diego Maradona won against Mexico

Dictionary against Diego Maradona Mexico won
11
Topic Annotators
  • Diego Maradona won against Mexico

Detect mentions and annotate them with
entity/topic extracted from a catalog
Wikipedia!
we serve about 170k requests/day
12
A new scenario
obama asks iran for RQ-170 sentinel drone back
us president issues Ahmadinejad ultimatum
13
The literature
Many commercial software AlchemyAPI, DBpedia
Spotlight, Extractiv, Lupedia, OpenCalais, Saplo,
SemiTags, TextRazor, Wikimeta, Yahoo! Content
Analysis, Zemanta.
Paper at WWW 2013, we serve about 170k
requests/day
14
Paper at ACM WSDM 2012
Paper at IEEE Software 2012
Details on...http//acube.di.unipi.it/tagme
Paper at ECIR 2012
Write a Comment
User Comments (0)
About PowerShow.com