Language-Model Based Text-Compression - PowerPoint PPT Presentation

About This Presentation

Title:

Language-Model Based Text-Compression

Description:

Language-Model Based Text-Compression James Connor Antoine El Daher Compressing with Structure Compression Huffman Arithmetic Lempel Ziv (LV78 LV77) Most popular ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 9

Provided by: Antoi84

Learn more at: https://nlp.stanford.edu

Category:

Tags: based | chain | compression | language | markov | model | text

Transcript and Presenter's Notes

Title: Language-Model Based Text-Compression

1
Language-Model Based Text-Compression

James Connor
Antoine El Daher

2
Compressing with Structure

Compression
Huffman
Arithmetic
Lempel Ziv (LV78 LV77)
Most popular compression tools based on LV77
Exploiting structure
Our goal incorporate prior knowledge about the
structure of the input sequence

3
Perplexity and Entropy

Compression ratio is bounded by the Entropy of
the sequence to be compressed
A low-perplexity language model is also a
low-entropy distribution

4
Character N-grams

Represent text as an nth order markov chain of
characters
Maintain counts of n-grams
Build a library of huffman tables based on these
counts

5
Compressing the file

Training
For each bigram in the training set, we keep a
map of all the words that can follow it, along
with their probabilities.
E.g. to have ? (seen, 0.1), (been, 0.1),
(UNK, 0.1), etc.
Then for each bigram, we build a Huffman tree.

6
Compressing the File

Compressing
We go through the input file, using the Huffman
trees from the training set to code each word
based on the two preceding words.
If the trigram is unknown, we code the UNK token,
the revert to a unigram model (also coded using
Huffman).
If the unigram is unknown, we use a character
level Huffman (trained on the training set) to
code it.
Decompression works similarily we mimic the same
behavior

7
Extensions

We have a sliding context window, so that
whenever we are compressing a file, words that
are seen there have their counts incremented when
they enter the window (and decremented when they
leave) this allows us to make better use of the
local context in terms of trigrams/bigrams, and
give more representative weights.

8
Results

Competitive with Gzip

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval PowerPoint PPT Presentation

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval - This English dictionary D is partitioned into disjoint dictionaries Di, each ... If the word in the input text is not in the English dictionary (viz. ... | PowerPoint PPT presentation | free to view

Language Resources and Machine Learning PowerPoint PPT Presentation

Language Resources and Machine Learning - Machine translation. Information retrieval and extraction, text ... machine translation. One approach: lemma = stem ... (Slovene translation) from MULTEXT ... | PowerPoint PPT presentation | free to view

Visualization based Intelligent Tutoring System (ITS) for Greedy Algorithms: GATutor PowerPoint PPT Presentation

Visualization based Intelligent Tutoring System (ITS) for Greedy Algorithms: GATutor - Visualization based Intelligent Tutoring System (ITS) for Greedy Algorithms: GATutor. By. MeenakshiVerma. Mukund Lahoti. Guided By: Prof. S.R. Iyer | PowerPoint PPT presentation | free to view

CS276A Text Retrieval and Mining PowerPoint PPT Presentation

CS276A Text Retrieval and Mining - CS276A Text Retrieval and Mining Lecture 13 [Borrows s from Ray Mooney and Soumen Chakrabarti] Recap: The Language Model Approach to IR Consider probability of ... | PowerPoint PPT presentation | free to view

Beyond Conceptual Metaphor Theory: Towards a Usage-based Account of Figurative Language PowerPoint PPT Presentation

Beyond Conceptual Metaphor Theory: Towards a Usage-based Account of Figurative Language - Title: PowerPoint Presentation Last modified by: test Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles | PowerPoint PPT presentation | free to view

Random Forests for Language Modeling PowerPoint PPT Presentation

Random Forests for Language Modeling - Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006 What Is a Language Model? A probability distribution over word sequences ... | PowerPoint PPT presentation | free to view

Model Core Curriculum for Iowa High Schools PowerPoint PPT Presentation

Model Core Curriculum for Iowa High Schools - Efficient file compression (mp3, zip, ...) Not a lot, but important ... Mathematics Model Core Curriculum document online (and download): http://www.iowamodelcore.org ... | PowerPoint PPT presentation | free to view

Fractal Composition of Meaning: Toward a Collage Theorem for Language PowerPoint PPT Presentation

Fractal Composition of Meaning: Toward a Collage Theorem for Language - ... millions of pixels, determine transforms for target image, and store them ... Compression algorithm computes and stores locations and transforms of tiles ... | PowerPoint PPT presentation | free to view

Teaching Applied Computing without Programming: A Case-Based Introductory Course for General Education PowerPoint PPT Presentation

Teaching Applied Computing without Programming: A Case-Based Introductory Course for General Education - Teaching Applied Computing without Programming: A Case-Based Introductory Course for General Education 1 012CSE15 | PowerPoint PPT presentation | free to view

Research in GIS/GIScience PowerPoint PPT Presentation

Research in GIS/GIScience - ... OpenMap Opensource GIS sw package Java-based Web-based GUI to Oracle Spatial DB OpenMap GUI OpenMap GUI Mobile GUI ... etc. Oracle Spatial Text-based ... | PowerPoint PPT presentation | free to view

Chapter 6 Text and Multimedia Languages and Properties PowerPoint PPT Presentation

Chapter 6 Text and Multimedia Languages and Properties - Chapter 6 Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University | PowerPoint PPT presentation | free to view

Vocabulary size and term distribution: tokenization, text normalization and stemming PowerPoint PPT Presentation

Vocabulary size and term distribution: tokenization, text normalization and stemming - Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2 Overview Getting started: tokenization, stemming, compounds end of ... | PowerPoint PPT presentation | free to view

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View PowerPoint PPT Presentation

The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View - The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz ... | PowerPoint PPT presentation | free to view

MEAD 3.09 A platform for multidocument multilingual text summarization PowerPoint PPT Presentation

MEAD 3.09 A platform for multidocument multilingual text summarization - MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia University University of Pennsylvania, Johns ... | PowerPoint PPT presentation | free to view

Research Topics Natural Language Processing Image Processing PowerPoint PPT Presentation

Research Topics Natural Language Processing Image Processing - Research Topics Natural Language Processing Image Processing CSC 3990 | PowerPoint PPT presentation | free to view

Extensible Markup Language XML PowerPoint PPT Presentation

Extensible Markup Language XML - Structure of data wrapped around data. Structured. Fixed structure of data ... STAR Sandra Bullock /STAR /FILM Element Content Specification. Mixed Content Model ... | PowerPoint PPT presentation | free to view

Research Topics Natural Language Processing Image Processing PowerPoint PPT Presentation

Research Topics Natural Language Processing Image Processing - Natural Language Processing Image Processing ... and test systems that process natural languages for practical applications Applications speech processing: ... | PowerPoint PPT presentation | free to view

CS 388: Natural Language Processing Introduction PowerPoint PPT Presentation

CS 388: Natural Language Processing Introduction - NLP is the branch of computer science focused on developing ... Clouseau: [bowing down to pet the dog] Nice doggie. [Dog barks and bites Clouseau in the hand] ... | PowerPoint PPT presentation | free to view

Building terminology and conceptual (ontological) systems from text corpora? PowerPoint PPT Presentation

Building terminology and conceptual (ontological) systems from text corpora? - ... 1378 carbon nanotubes 647 z nanotubes 24 aligned carbon nanotubes 48 multiwalled carbon nanotubes 46 single-wall carbon nanotubes 24 vertically aligned carbon ... | PowerPoint PPT presentation | free to view

CS276B Text Retrieval and Mining Winter 2005 PowerPoint PPT Presentation

CS276B Text Retrieval and Mining Winter 2005 - ... basics for the project Possible project topics Helpful tools you might want to know about Overview of 276B Consider it the ... Project presentations ... | PowerPoint PPT presentation | free to view

Shared Source Common Language Infrastructure Taking .NET Across Platforms PowerPoint PPT Presentation

Shared Source Common Language Infrastructure Taking .NET Across Platforms - Collections. Resources. Reflection. Net. IO. Threading. Text. ServiceProcess. Security. Runtime ... Designers. SDK Tools. CorDBG. ILAsm. ILDbDump. SN. ILDAsm ... | PowerPoint PPT presentation | free to view

Using%20CTW%20as%20a%20language%20modeler%20in%20Dasher PowerPoint PPT Presentation

Using%20CTW%20as%20a%20language%20modeler%20in%20Dasher - Conditional probability for each alphabet symbol, given the ... Exclusion: only use Betas of the actual model. Iterative process. Convergent? Approximation: ... | PowerPoint PPT presentation | free to view

Uniscript: a Model for Persistent and Incremental Knowledge Storage PowerPoint PPT Presentation

Uniscript: a Model for Persistent and Incremental Knowledge Storage - Uniscript: a Model for Persistent and Incremental Knowledge Storage ... The knowledge of a person gathered and ... Each piece of knowledge should be unique. ... | PowerPoint PPT presentation | free to view

Component-based Computing implications for Application Architectures PowerPoint PPT Presentation

Component-based Computing implications for Application Architectures - Component-based Computing implications for Application Architectures Julie A. McCann Imperial College, Department of Computing London UK jamm@doc.ic.ac.uk | PowerPoint PPT presentation | free to view

Spectral Features for Automatic Text-Independent Speaker Recognition PowerPoint PPT Presentation

Spectral Features for Automatic Text-Independent Speaker Recognition - thesis, 144 pages, Department of Computer Science, University ... As the first component in the recognition chain, the accuracy ... Spectrograph wasn't ... | PowerPoint PPT presentation | free to view

CS276B Text Retrieval and Mining Winter 2005 PowerPoint PPT Presentation

CS276B Text Retrieval and Mining Winter 2005 - Tadpole. Search engine spam. Lexical chains. English text compression. Recommendation systems ... Tadpole. Mahabhashyam and Singitham, Fall 2002 ... | PowerPoint PPT presentation | free to view

Text summarization PowerPoint PPT Presentation

Text summarization - NSF (joint with Kevin Quinn of Harvard, Burt Monroe of PSU) ... Language Model Based Document Clustering Using Random Walks (HLT-NAACL 2006) ... | PowerPoint PPT presentation | free to view