SAMPLE DOCUMENTS README

SAMPLE DOCUMENTS

DURITO PROJECT

2 June, 2001

MARKUP SCHEME: GENERAL NOTES

The interview transcription uses XML markup to indicate some of its structural features. The markup scheme is an implementation of the standard proposed by the Text Encoding Initiative (TEI). For a full description of the tags used, please refer to the TEI guidelines, available at http://www.tei-c.org/.

TEI proposes an extremely large number of tags; the encoder chooses from among these on the basis of which features of the text he or she would like to mark up. Only a few tags are absolutely required in order to produce TEI-conformant documents. The TEI scheme is also extensible: it establishes specific mechanisms for adding new tags, modifying existing tags, and documenting these changes.

The interview PHO-Z/1/68 is one of a large collection of interviews that are being marked up using this implementation of the TEI standard. Said implementation has been designed by the "Testimonios Zapatistas" project (TZ) specifically for the purposes of this collection. Its basic goal is to express, in structured, machine-readable terms, much of the same information that transcribers normally include in non-marked-up transcriptions. Thus, markup is used to indicate which words were spoken by whom, what tape recordings were used as sources for different parts of the transcription, which passages are not clearly audible, etc. Also, a certain amount of metadata is contained in the <teiHeader> element, obligatory according to the TEI standard.

In addition to the standard tags used in the header, the TZ implementation uses 9 TEI-defined tags in the body of the transcription. Two new tags have been added to the TEI tagset, and a non-standard attribute has been added to one TEI-defined tag. Most tags are used in a more restricted manner than is formally allowed by TEI. For example, we've severely limited the points in the XML structure where certain tags may occur, and have forbidden the use of many optional attributes. Such simplifications make life easier for transcribers and programmers alike, and are possible due to the relatively uniform nature of the source material.

We've not yet put together or obtained a DTD or a Schema for this markup scheme. At the end of this text, you'll find an approximation of a formal description of the conventions used; but it's not a compliant DTD.

The markup may seem an odd mix of Spanish and English. Basically, we've kept tag and attribute names in English, but text and most attribute values are in Spanish. Stated differently: the markup scheme is in English, and the content is in Spanish.

This scheme is only a first layer of encoding. Other more interpretative information--such as subject tags, place and person name indicators--will be added at later stages. For the time being, such tags are only employed in the <teiHeader>.

A NOTE ON METADATA

As mentioned, the TEI standard requires that a significant amount of metadata be included in the <teiHeader>. However, according to Durito's design, at least some of this information should be stored using RDF. In order to implement this digital collection, we should decide how to deal with this repetition of information. Some RDF data might be generated automatically from the XML files.

This problem illustrates how TEI concretizes a set of syntaxes (DTD's), more than a set of concepts (which are what RDF Schema help you represent). Certainly, the concepts are there, but you have to read the guidelines to get to them; they're not machine-understandable. Perhaps future efforts in the area of text encoding should concentrate on generating machine-understandable representations of networks of concepts, and standardized ways of describing how those concepts can be expressed, that is, turned into syntax.

TAGS EMPLOYED

Here is an explanation of the tags used in this excerpt, as implemented by TZ, in order of appearance:

<TEI.2>

This is the root element of TEI documents.

<teiHeader type='text'>

This tag contains the required header section, where metadata is recorded. type='text' means that this metadata refers to a single text and not a collection. All of the interviews TZ encodes have this value for the type attribute of the <teiHeader>.

<fileDesc>

A container for the most basic metadata.

<titleStmt>

Holds the text's titles and the main list of individuals responsible for its creation.

<title type='type'>

A title.

Within the <titleStmt>, this element occurs twice: once with type set to "principal", and once with the type set to "clasificación". The first occurrence contains the text's main title. The second holds a secondary title, which is the interview's call number within our collection.

Within <seriesStmt>, <title> holds the title of the series that this text is a part of. Here, it has no attributes.

<persName>

The name of a person. For the time being, TZ only uses this element in the <teiHeader>.

Only in one location does <persName> have a type attribute: when it's contained within a <person> element. In that place, it may have type set to "iniciales", in which case it contains the initials of the person being referred to. (See below.)

<date value='date'>

A date. value is an ISO-format date. For the time being, TZ only uses this element in the <teiHeader>.

<placeName>

A place name. For the time being, TZ only uses this element in the <teiHeader>.

<respStmt>

List of people and entities responsible for some aspect of the text's creation. When this element appears within <titleStmt> it holds the main list of those responsible for the intellectual content of the text.

When <respStmt> is contained within <editionStmt>, it provides a secondary set of credits associated with this specific edition of the text.

<resp>

A role in the creation of the text, for which responsibility is ascribed in the following <name> element.

<name>

The name of the person or entity responsible for whatever role was described in the previous <resp> element.

<orgName>

The name of an organization. For the time being, TZ only uses this element in the <teiHeader>.

<editionStmt>

Information on this specific edition of the text.

<edition>

Name of this specific edition of the text.

<publicationStmt>

Ascribes responsibility for the publication of the text, and defines the terms of its distribution.

<authority>

Describes the person or entity by whose authority the text has been published.

<address>

The address of the publisher. Tentatively, we've placed in the second <address> element an entity that will hold the publisher's Internet address (as yet undefined).

<availability>

The terms under which the text is made available.

A paragraph, or a blurb of text of any length. Many TEI elements don't take straight text (PCDATA) and require that plain prose be wrapped in one of these.

<seriesStmt>

A description of the series that this text is a part of.

<sourceDesc>

Information on the material that was used as a source for this electronic text.

<bibl>

A bibliographic reference. TZ only uses it within the <sourceDesc>. It holds the interview's previous call number, now in disuse.

<recordingStmt>

A set of descriptions of recordings.

<recording id='id' type='audio' dur='duration'>

A description of a recording. The id attribute provides a unique identifier for each <recording> element. The identifier is then used by other tags to refer back to this element. The type and dur attributes are just what they appear to be. The value of type does not vary within our collection.

Most interviews were recorded on more than one bit of magnetic tape. There is always one <recording> tag for every segment of tape of the interview in question.

<profileDesc>

A container for additional metadata that is considered "non-bibliographic".

<langUsage>

A list of languages employed in the text.

<language id='id'>

A description of a language. Again, the id attribute here assigns this element a unique identifier that other elements can refer back to.

<particDesc>

A list of participants in the interview.

<person id='id' role='role'>

A person who participated in the interview. role describes his or her role, and id is used by subsequent elements to refer back to this person.

<settingDesc id='id'>

This is used to refer to a setting in which the interview was conducted. If the interview spanned several days, then a different <settingDesc> is used for each session. id allows other elements to refer back to this one.

<textClass>

A container for classificatory identifiers of this text.

<classCode scheme='Testimonios Zapatistas'>

Here we repeat the text's call number within the TZ collection. scheme will always have "Testimonios Zapatistas" as its value.

<text>

A TEI-imposed container for the main part of the text.

<body>

Another TEI-imposed container for the main part of the text.

<div type='sesión' decls='idref'>

Indicates the broadest divisions within the interview, namely "sessions", that is, occasions on which the interviewer(s) recorded the interviewee. Many interviews consist of only one such division. type is always "sesión", or session. decls takes the same value as the id attribute of the corresponding <settingDesc>.

Normally, <div> does not have a decls attribute; I have given it one.

<source decls='idref'/>

This tag indicates the source of the text that follows it. Most of the time, the text has been transcribed from an audio recording; in those cases, decls takes the same value as the id attribute of the corresponding <recording>.

There are a few interviews for which part of the audio recording has been lost, but a complete typewriter transcription survives. In those situations, we have transcribed the missing sections from the typewriter copy; <source> then links to a description of that document.

This tag is not defined by TEI.

The standard TEI utterance tag. It contains words spoken by one of the participants in the interview. The who attribute takes the same value as the id attribute of the <person> that refers to the person who spoke those words. n is used simply to number the utterances.

<xptr type='tiempo correspondiente en el audio' doc='reference to MP3 file' from='HyQ time in milliseconds'/>

Links points in the text to corresponding times in the improved and compressed version of the audio recording. It appears at semi-regular intervals throughout the text. type is always "tiempo correspondiente en el audio". doc must refer to the MP3 file being synchronized. The value of from always begins with HyQ and is followed by corresponding time in the recording, expressed in milliseconds.

Note: the syntax used in the attribute values of this tag will likely change.

<unclear>

A container for words that were not spoken or recorded clearly. Transcribers were uncertain as to the accuracy of these parts of the transcription. The tag is used without attributes.

<orig reg='regularized word form or definition' n='reference key in TZ's database'>

Holds words that are not part of standard Mexican Spanish. reg contains a standard form or definition of the word. n is a key that identifies that word in our database.

Other tags are used throughout the collection of interviews, but do not appear in this fragment. They are:

<shift feature='voz' new='distinct vocal feature'/>

Indicates a shift in the speaker's tone of voice. This tag applies to whatever text follows it. feature is always "voz", and new describes the new characteristic of the voice. Possible values for new include "cantando" (singing), "llorando" (crying), "gritando" (shouting), "en tono normal" (in a normal tone). This is an empty tag; the feature it describes continues to apply to the expressions of the person within whose  tag it's contained until another <shift> is encountered in a  of that same person.

For example, in the following passage, the person referred to by the identifier "e0032" began crying after the third word of utterance number 200, and stopped crying before utterance number 204. Nothing is said about the tone of voice of person "i0029".

Ya sabemos que <shift feature='voz' new='llorando'/> Zapata murió, que lo asesinaron, los canijos.
Bueno, pues...
¡Qué triste fue! ¡Triste, triste, triste!
Sí, sí.
<shift feature='voz' new='en tono normal'/> Pero ya ¿qué podemos hacer?

<vocal desc='description of the non-lexical vocal sound'>

Indicates a non-lexical vocal sound, such as a sneeze or a cough, produced by the person within whose  it is contained.

<gap reason='no se entiende'/>

Means that some amount of vocal expression is incomprehensible and thus has not been transcribed. reason is always "no se entiende".

<recordingInterrupt/>

Indicates a point where sounds on the magnetic tape seem to indicate that the tape recorder was stopped and then re-started. It is an empty element, has no attributes, and is not defined by TEI.

<note>

Contains an annotation that cannot be indicated using any of the other tags.

CONTENT MODEL

The following code uses the syntax of XML DTD's to describe the tags that TZ employs in the body of interviews. It is not a complete DTD; it could be a fragment of one. Any document that conforms to the TZ encoding scheme should repsect the constraints expressed below. These constraints are much narrower than those enforced by standard TEI DTD's. I do not mean to suggest that the following text be integrated into a DTD that we would finally deploy.

<!ELEMENT text (body)>
<!ELEMENT body (div)>
<!ELEMENT div (note*, source, note*, u, (u | source | note
| recordingInterrupt)* )>
<!ELEMENT note (#PCDATA)>
<!ELEMENT source EMPTY>
<!ELEMENT u (#PCDATA | note | orig | shift | vocal | gap |
source | unclear | xptr)*>
<!ELEMENT recordingInterrupt EMPTY)>
<!ELEMENT orig (#PCDATA)>
<!ELEMENT shift EMPTY>
<!ELEMENT vocal EMPTY>
<!ELEMENT gap EMPTY>
<!ELEMENT unclear (note | orig | shift | vocal )*>
<!ELEMENT xptr EMPTY>
<!ATTLIST text >

<!ATTLIST body >
<!ATTLIST div
            type             CDATA          #FIXED           "sesión"
            decls            IDREF          #REQUIRED >
<!ATTLIST note >
<!ATTLIST source
            decls            IDREF          #REQUIRED >
<!ATTLIST u
            who              IDREF          #REQUIRED >
            n                CDATA          #REQUIRED >
<!ATTLIST recordingInterrupt >
<!ATTLIST orig
            reg              CDATA            #REQUIRED
            n                CDATA            #REQUIRED >
<!ATTLIST shift
            feature          CDATA            #FIXED         "voz"
            new              CDATA            #REQUIRED >
<!ATTLIST vocal
            desc             CDATA            #REQUIRED >
<!ATTLIST gap
            reason           CDATA            #FIXED         "no se entiende" >
<!ATTLIST unclear >
<!ATTLIST xptr
            type             CDATA            #FIXED         "tiempo correspondiente en el audio"
            doc              CDATA            #REQUIRED
            from             CDATA            #REQUIRED >