Text - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Text

Description:

It is a specification that accompanies an annotated document. ... SGML is the acronym for Standard Generalized Markup Language. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 32
Provided by: sreeramsr
Category:
Tags: acronym | text

less

Transcript and Presenter's Notes

Title: Text


1
Text Annotation Techniques
Presented By
Sreeram Sreenivasan
2
What is an annotated text ?
  • Annotated text (Eg.)
  • lthtmlgt
  • lttitlegt Sample Document
  • lt/titlegt
  • ltbodygt This is an annotated text document.
  • lt/bodygt
  • lt/htmlgt
  • Ordinary Text (Eg.)
  • This is an ordinary text document.

3
Key Words
  • DTD
  • SGML

4
Document Type Definition
  • It is a specification that accompanies an
    annotated document.
  • It enables the parser in identifying what the
    codes (or markup) are that separate paragraphs,
    identify topic headings,
  • It also intimates to the parser on how each tag
    is to be processed.
  • The DTD for every document is generally placed on
    top of the document.

  • .

5
Standard Generalized Markup Language
  • SGML is the acronym for Standard Generalized
    Markup Language.
  • It is a standard for how to specify a document
    markup language or a tag set.
  • The SGML is itself a DTD.


SGML
HTML
XML
WML
6
  • HTML (Hyper Text Markup Language)
  • XML (eXtensible Markup Language)
  • WML (Wireless Markup Language)
  • TEI (Text Encoding Initiative)
  • References

7
HYPER TEXT MARKUP LANGUAGE
  • It is the set of markup symbols or codes inserted
    in a file intended for display on a World Wide
    Web browser page.
  • The markup tells the Web browser how to display a
    Web page's text and images for the user.
  • Each individual markup code is referred to as an
    element (but is generally referred to it as a
    tag).
  • Some elements come in pairs that indicate when
    some display effect is to begin and when it is to
    end.

8
  • The basic annotations on an HTML page
  • Document Tags HTML, HEAD, TITLE, BODY Comment
    Tags
  • Basic Text Structures Headings, Paragraph, Line
    Break Blockquote
  • Anchors HREF NAME
  • Images IMG, ALIGN ALT
  • ..

9
(No Transcript)
10
Heading Tag
  • Heading 1
  • Heading 2
  • Heading 3
  • Heading 4
  • ltH1gtHeading 1lt/H1gt ltH2gtHeading 2lt/H2gt
    ltH3gtHeading 3lt/H3gt ltH4gtHeading 4lt/H4gt

11
Paragraph Tag
  • ltPgtThis sort of paragraph usually deserves to
    be broken up into several paragraphs, since its
    sheer bulk dissuades the reader from attempting
    to plumb its depths. lt/Pgt ltPgt On the other hand,
    they can be pretty short. lt/Pgt ltPgt Really short.
    lt/Pgt
  • This sort of paragraph usually deserves to be
    broken up into several paragraphs, since its
    sheer bulk dissuades the reader from attempting
    to plumb its depths.
  • On the other hand, they can be pretty short.
  • Really short.
  • ..

12
HTML specifics
  • Though there are special editors for writing HTML
    files we can use the basic MS-Word or emacs in
    Unix.
  • The tags in HTML are not case sensitive ie. tags
    lttitlegt and ltTITLEgt mean the same.
  • The HTML files can be viewed with Browsers (IE or
    Netscape), parsers or SGML compilers since it is
    standardized.
  • Sample Document

13
  • lthtmlgt
  • lttitlegt Sample Document
  • lt/titlegt
  • ltbodygt
  • ltpgt This is a sample HTML document.lt/pgt
  • ltpgtIt illustrates the usage of tags with the
    actual text.lt/pgt
  • lt/bodygt
  • lt/htmlgt

  • ..

14
EXTENSIBLE MARKUP LANGUAGE
  • Definition
  • It is a flexible way to create common
    information formats and share both the format and
    the data on the World Wide Web, intranets, and
    elsewhere.

15
Differences with HTML
  • Tags Semantics are flexible (facilitates the
    programmer to define specify tags . E.g.. ltPgt in
    XML can mean a paragraph or phone no).
  • Processing of XML documents depends on receiving
    application.
  • Supports links to multiple documents.
  • XML contains tags that describe the data. E.g..
    ltphonenogt may describe a telephone no. Tags may
    also include attributes like that of HTML.
  • A forgotten tag in an XML program makes file
    unusable unlike HTML where it may be bypassed.

16
Relation with SGML
  • It is basically a subset of SGML(Standard
    Generalized Markup Language).
  • SGML is a standard to specify the document
    language set.
  • Like SGML, XML is based on the principle that
    documents have elements that can be described
    without reference to how data should be displayed
    i.e.. XML files are created thinking in terms of
    document structure and not appearance).

17
Elements of XML Language
  • An element of XML is a start tag, an end tag and
    data between.
    E.g..
    ltdirectorgtEd Woodlt/directorgt
  • Attributes may also be assigned to element by
    tags.
    E.g.. ltdirectorHollywoodgtEd
    Woodlt/directorgt
    (Unlike in HTML tags are
    case-senstive)
  • Sample XML Document (Well-formed).
  • Sample XML Document (Valid).

18
  • lt?xml version"1.0"?gt
  • ltdocgt
  • ltburnsgtSayltquotegtgoodnightlt/quotegt,
  • Gracie.lt/burnsgt
  • ltallengtltquotegtGoodnight,
  • Gracie.lt/quotegtlt/allengt
  • ltapplause/gt
  • lt/docgt

19
  • 1 lt?xml version"1.0"?gt
  • 2 lt!DOCTYPE PARENT
  • 3 lt!ELEMENT PARENT (CHILD)gt
  • 4 lt!ELEMENT CHILD (MARK?,NAME)gt
  • 5 lt!ELEMENT MARK EMPTYgt
  • 6 lt!ELEMENT NAME (LASTNAME,FIRSTNAME)gt
  • 7 lt!ELEMENT LASTNAME (PCDATA)gt
  • 8 lt!ELEMENT FIRSTNAME (PCDATA)gt
  • 9 lt!ATTLIST MARK
  • NUMBER ID REQUIRED
  • LISTED CDATA FIXED "yes"
  • TYPE (naturaladopted) "natural"gt
  • 10 lt!ENTITY STATEMENT "This is well-formed
    XML"gt
  • 11 gt

20
  • ltPARENTgt
  • STATEMENT
  • ltCHILDgt
  • ltMARK NUMBER"1" LISTED"yes" TYPE"natural"/gt
  • ltNAMEgt
  • ltLASTNAMEgtchildlt/LASTNAMEgt
  • ltFIRSTNAMEgtsecondlt/FIRSTNAMEgt
  • lt/NAMEgt
  • lt/CHILDgt
  • lt/PARENTgt

21
Efficiency of XML in Information Retrieval
  • Meaningful Markup
  • Single approach can accommodate document and data
    structures and integrates both within documents.
  • Enables transfer of data between applications
  • Structural similarity to HTML simplifies
    implementation using traditional web servers/
    browser applications CGI and java.
  • ...

22
  • Files can be processed purely as data - enabling
    it to be stored or displayed.
  • Files are text verbose - allows easy debugging
  • It license-free, platform independent well
    supported.

  • .

23
(No Transcript)
24
WIRELESS MARKUP LANGUAGE
  • It is an annotation technique that allows the
    text portions of Web pages to be presented on
    cellular telephone and personal digital
    assistants (personal digital assistant) via
    wireless access.
  • WML is part of the Wireless Application Protocol
    (WAP) that is being proposed by several vendors
    to standards bodies.
  • It is formerly called HDML (Handheld Devices
    Markup Language) .

  • .

25
  • Just like HTML and XML, WML is read and
    interpreted by a browser built into the WAP
    device.
  • For WAP devices, the browser is commonly called a
    micro browser which has inherently limited
    capabilities compared to the web browser.
  • Though HTML can be used WML is used as it has
    lesser bandwidth resources.
  • Also WML uses lesser power to process compared to
    HTML.

  • .

26
TEXT ENCODING INITIATIVE
  • Definition
  • TEI is an international project to develop
    guidelines for the preparation and interchange of
    electronic texts for scholarly research.

27
Need for a common encoding scheme
  • Till the TEI project was undertaken there has not
    been any common encoding format for scholarly
    machine-readable texts.
  • None of the existing encoding schemes has been
    able to gain acceptance as a standard.

28
Origin of TEI factors contributing to it
  • TEI arose out of a planning conference convened
    by ACH at Vassar College, Poughkeepsie, New York
    in November 1987
  • Factor I More is known now about the problems
    of text encoding than at the time of previous
    attempts
  • Factor II The recently developed Standard
    Generalized Markup Language (SGML) seemed to be
    the ideal text-encoding scheme.

29
Objectives of TEI
  • A. To specify a common interchange format for
    machine readable texts
  • B. To provide a set of recommendations for
    encoding new textual materials.
  • C. To document the major existing encoding
    schemes

30
Why TEI chose SGML ?
  • Easier to borrow syntax from an existing scheme.
  • The syntax must be relatively simple and must
    allow for user-defined extensions to the
    pre-defined set of tags.
  • SGML was soon shown to meet all the requirements
    of the TEI
  • SGML also permits usage of multiple tag- sets in
    the same text.

  • .

31
References
  • HTML http//www.ncsa.uiuc.edu/General/Internet/W
    WW/HTMLPrimer.html
  • XML http//www.w3.org/XML/
  • WML http//www.allnetdevices.com/faq/
  • TEI http//www.uic.edu/orgs/tei/
Write a Comment
User Comments (0)
About PowerShow.com