Tei@DHSI 1 — Introduction to Markup, XML, and the TEI

Till Grallert

1 June 2015

Introduction to Markup, XML, and the TEI

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the needs of the 2015 Introduction to TEI at DHSI.

Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.

Textual Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts.

When we talk about text encoding, what do we mean by a text? What is in a text and which assumptions do we make in reading them?

What is a text?

Is this text …

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

… the same as this text …

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

… the same as this text …

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

… the same as this text?

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

Damascus, Quarterly Report, Devey to Lowther 1 Oct. 1908

A text is not a document

Where is the text?

TEI’s definition:

Encoding of texts

What is the point of markup?

Styles of markup

Some more definitions

Separation of form and content

Markup as scholarly activity

Compare markup

Example 1:

<hi rend="dropcap">H</hi>&WYN;ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe <add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl <add>a</add>
<lb/>of<damage>
<desc>blot</desc> </damage>teah ...

Example 2:

<lg>
    <l>Hwæt! we Gar-dena in gear-dagum</l>
    <l>þeod-cyninga þrym gefrunon,</l>
    <l>hu ða æþelingas ellen fremedon,</l>
</lg> 
<lg>
    <l>Oft Scyld Scefing sceaþena þreatum,</l>
    <l>monegum mægþum meodo-setla ofteah;</l>
    <l>egsode Eorle, syððan ærest wearþ</l>
    <l>feasceaft funden...</l>
</lg>

A useful mental exercise

Imagine you are going to markup several thousand pages of complex material….

Now, imagine your budget has been halved. Repeat the exercise!

Some alphabet soup

abbr expan
SGML Standard Generalized Markup Language
HTML Hypertext Markup Language
W3C World Wide Web Consortium
XML eXtensible Markup Language
DTD Document Type Definition (or Declaration)
CSS Cascading Style Sheet
Xpath XML Path Language
XSLT eXtensible Stylesheet Language - Transformations
XQuery XML Querying
RELAXNG Regular Expression Language for XML (New Generation)
SVG Scalable Vector Graphics (expressed in XML)

… and then there’s also TEI, the Text Encoding Initiative

XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web and elsewhere.

XML: what it is and why you should care

XML terminology 1

An XML document may contain:

XML terminology 2

The rules of the XML Game

Representing an XML tree

Parts of a real XML document

<?xml version="1.0"?>
<greetings xmlns="http://www.example.org/greetings">
    <hello type="enthusiastic">hello world!</hello>
</greetings>

The XML declaration

An XML document must begin with an XML declaration which does three things:

Example:

<?xml version="1.0" ?>
<?xml version="1.0" encoding="iso-8859-1" ?>

Declaring namespaces

All TEI documents are declared within the TEI namespace — a way of distinguishing one set of elements from another with the same names (like <p>):

<TEI xmlns="http://www.tei-c.org/ns/1.0"> ... </TEI>

XML documents can include elements declared in different namespaces.

Example:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:math="http://www.mathml.org">
<p>...
    <math:expr>...</math:expr>
    ...</p>
</TEI>

The xml namespace is used by the TEI for global attributes @xml:id and @xml:lang

Example: Kawkab America #55, 28 April 1893

<?xml version="1.0" encoding="UTF-8"?>
<div type="article" xml:lang="en">
    <head xml:lang="ar">الشرق في معرض <placeName>شيكاغو</placeName></head>
    <head xml:lang="en">The orient at fair.</head>
    <p>Is there anybody left in <placeName>Syria</placeName>, <placeName>Egypt</placeName>,
        <lb/><placeName>Turkey</placeName>, <placeName>Morocco</placeName>, and the other countries
        <lb/>of the Orient? Were the questions asked
        <lb/>by officers at <placeName>Ellis Island</placeName> and the Orient
        <lb/>als of <placeName>New York</placeName> within the last few
        <lb/>weeks. The long expected concessioners, 
        <lb/>exhibitors and participants in the <orgName>World's
        <lb/>Fair</orgName>, who for many days and weeks have
        <lb/>been directing their footsteps from the
        <lb/>various lands of the rising sun towards
        <lb/>the "<q>new land of promise</q>" have arrived
        <lb/>in large numbers, and set foot upon the
        <lb/>soil of the new world which they have 
        <lb/>sought with feelings of high expectation,
        <lb/>and an eagerness to which long distance
        <lb/>had added many charms. The Sheikh
        <lb/>who from childhood hours had learned to
        <lb/>praise Allah for every blessing of life,
        <lb/>must have shouted a hearty "<quote>Alhamduli
        <lb/>la! and Allah<gap/> Kariem!</quote>" when after a
        <lb/>journey of some weeks and months by
        <lb/>land and by sea he saw in <placeName>New York</placeName> har
        <lb/>bor the majestic form of the Goddess of
        <lb/>Liberty with the beacon of light in her
        <lb/>outstretched hand bidding him welcome
        <lb/>to the "<q>home of the brave and the land of
        <lb/>the free.</q>"</p>
</div>

Example deconstructed: root node

<?xml version="1.0" encoding="UTF-8"?>
<div type="article" xml:lang="en">
    <!-- ... -->
</div>

Example deconstructed: head

<head xml:lang="ar">الشرق في معرض <placeName>شيكاغو</placeName></head>
<head xml:lang="en">The orient at fair.</head>

Example deconstructed: paragraph, quotes, and named entities

<p>Is there anybody left in <placeName>Syria</placeName>, <placeName>Egypt</placeName>,
<lb/><placeName>Turkey</placeName>, <placeName>Morocco</placeName>, and the other countries
<lb/>of the Orient? Were the questions asked
<lb/>by officers at <placeName>Ellis Island</placeName> and the Orient
<lb/>als of <placeName>New York</placeName> within the last few
<lb/>weeks. The long expected concessioners, 
<lb/>exhibitors and participants in the <orgName>World's
<lb/>Fair</orgName>, who for many days and weeks have
<lb/>been directing their footsteps from the
    <!-- ... -->
<lb/>had added many charms. The Sheikh
<lb/>who from childhood hours had learned to
<lb/>praise Allah for every blessing of life,
<lb/>must have shouted a hearty "<quote>Alhamduli
<lb/>la! and Allah<gap/> Kariem!</quote>" when after a
    <!-- ... -->
<lb/>outstretched hand bidding him welcome
<lb/>to the "<q>home of the brave and the land of
<lb/>the free.</q>"</p>

XML syntax: the small print

What does it mean to be well-formed?

  1. There is a single root node containing the whole of an XML document
  2. Each subtree is properly nested within the root node
  3. Element/attribute/etc. names are always case sensitive
  4. Start-tags and end-tags are always mandatory (except there is a combined start-and-end tag, e.g. <pb/>)
  5. Attribute values are always quoted

A file can be valid in addition to being well-formed. This means you obey the rules of a specified schema, such as the TEI.

Test your XML knowledge

Which are correct?

<seg>some text</seg>
<seg> <foo>some</foo> <bar>text</bar> </seg>
<seg> <foo>some <bar></foo> text</bar> </seg>
<seg type="text">some text</seg>
<seg type='text'>some text</seg>
<seg type=text>some text</seg>
<seg type="text"> some text <seg/>
<seg type="text"> some text<gap/> </seg>
<seg type="text">some text</Seg>

XML is an international standard

(The @xml:id attribute is another W3C-defined attribute.)

The TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts chiefly in the humanities, social sciences and linguistics.

1987 was a long time ago…

The Text Encoding Initiative was born into a very different world

…but also a familiar problems

The birth of the Text Encoding Initiative

TEI is old!

Why the TEI

The TEI provides

Relevance

Why would you want those things?

The scope of intelligent markup

Even within the original scope of the TEI we have

Reasons for attempting to define a common framework

The TEI was designed to support multiple views of the same resource. The TEI is an evolving model of the concerns of Digital Humanities.

TEI adopted XML

In 2002, the TEI consortium published the P4 Guidelines, which were essentially an adaptation of P3 to XML that had been finalised as W3C standard in 1998.

P5, a complete overhaul of the guidelines, was published in 2008. Updates are regularly published every couple of months ever since. The current version 2.8.0 was released on 6 April 2015.

The Guidelines are currently maintained as an open source project on the Sourceforge site http://tei.sf.net/, from which released and development versions may be freely downloaded.

TEI XML

Note: namespaces vs schemas

Conformance issues

A document is TEI Conformant if and only if it:

or if it can be transformed automatically using some TEI-defined procedures into such a document (it is then considered TEI-conformable).

A final note on standardization

Standardization should not mean “Do what I do”, but rather “Explain what you do in terms I can understand”.

Instead of an abstract set of rules and norms, standardisation should be thought of as a community of practice.