Testimonios Zapatistas - Sample Text Document

*****************************************

Durito

Documents foUnd, Reproduced and studIed using a free software
applicaTiOn
Documentos Ubicados, Reproducidos e Investigados a Través de un
sistema de sOftware libre

Mock-up/prototype documentation
7 November, 2001


*****************************************

Contents:

1. Introduction

2. Installation

3. How it works
	3.1 Overall framework and global stuff
	3.2 Text processing model
	3.3 Requests and instances
	3.4 URL mechanism
	3.5 Front ends and Netscape 4.x
	3.6 Audio front end: MADPLAY mock-up
	3.7 Analysis engine: MiniSearch mock-up
	
4. Summary: issues and plans

5. Change log

6. License


*****************************************

1. Introduction

Durito is a free software application that aims to provide a
flexible framework for publishing and analysing digital archives. 
It is still in the early stages of its development; for a detailed
explanation of initial design goals, please see
http://durito.sourceforge.net.

The present version is somewhere between a "prototype" and a
"mock-up": much of the code will likely be replaced in upcoming
versions.  This mock-up has served two purposes: (1) to define and
test algorithms, and (2) to provide a demonstration version of the
Testimonios Zapatistas (TZ) digital archive.

The current codebase only works with the TZ archive, though it
employs many mechanisms that are intended, in the long run, to be
generalizable.  It is not yet suitable for deployment on a network,
but rather is intended for local use on a single machine.

Comments on any aspect of this project are more than welcome, and
should be directed to our mailing list
(durito-devl@lists.sourceforge.net) or the author
(andrew_g11@users.sourceforge.net).  Many thanks in advance.


*****************************************

2. Installation

Durito requires various programs and components; many of them have
been bundled in the Durito_components release.  They are:

A) Win32 (tested under Win98SE and WinNT; GNU/Linux version should
be very easy to set up)
B) Perl (tested with ActiveState's distribution of Perl 5.6.1; may
also work with earlier versions)
C) Sablotron (an XLST engine)
D) New version of Expat used by Sablotron (Expat is an XML parser)
E) The following non-standard Perl modules: Win32::API, Win32::DDE
and XML::Sablotron
F) Other Perl modules included with the standard ActiveState
distribution: FindBin, XML::Simple, Win32::Process,
Win32::TieRegistry, IO::Socket, IO::Select, Socket, Tie::RefHash,
POSIX, LWP::MediaTypes, HTTP::Date, URI::Escape.
G) Netsape Communicator 4.x (Durito is not yet set up to run with
other browsers)
H) MADPLAY (an MPEG player)
I) A few Cygwin components: cygwin1.dll, ps.exe and kill.exe

All of the above except H) and I) should be installed in a standard
manner or as per the instructions included with the given component. 
MADPLAY and the Cygwin bits should be copied into a single
directory.

The files in the recent_code package (which includes this text)
should automatically unzip into an appropriate directory structure. 
The durito.conf file should be edited to accurately indicate the
paths of this structure.  Thus, the line

C:\durito\tests\durito\durito\HtDocs

should be edited to indicate the actual path of the "HtDocs"
directory.  Relative paths are O.K.; "." refers to the current
directory for durito.pl. (Thus ".\HtDocs" should, in most cases, be
a fine way to indicate the "HtDocs" directory.)

The tag  should contain the path of the directory in
which MADPLAY and the Cygwin components reside.

For audio functionality, please download the sample_audio_documents
package, and copy the its contents to the directory indicated by the
 tag.

You should then be ready to run Durito; just switch to the directory
containing durito.pl and type "perl durito.pl" from the command
prompt.

It is also possible to use PerlApp (from ActiveState) to create an
executable version that can be distributed on CD-ROM.  The file
pa_make.bat can be used run PerlApp with the appropriate
command-line arguments.  The resulting durito.exe file should be run
with "PerlApp" as its first argument.

The executable version of Durito can be run directly off a CD-ROM. 
(At least, this is the case for Win98 and WinNT; there may be some
problems under Win95.)  The only requirement, in all cases, is that
Netscape 4.x be installed on the target machine.

For the CD-ROM-based version, durito.conf should contain only
relative paths.  A few DLLs--namely, those used by Sablotron and
Expat--should be included on the CD-ROM, in the same directory as
the Durito executable.

There is one patch to be applied to an external module in order for
the PerlApp version to run correctly.  In XML::Simple, on line 227,
the call to the "new" method must be changed to use the direct
("->new()") syntax.

The TZ splash screen uses a Flash movie.  (This non-free element
will eventually be removed.)  To ensure correct handling of this
file, I've added the following line to the media.types file of the
LWP Perl module:

application/x-shockwave-flash	swf


*****************************************

3. How it works
	
>>>> 3.1 Overall framework and global stuff

Durito's long-term goal is to become a flexible, general-purpose
publishing and analysis tool.  Of course, that objective is still
far off; however, flexibility and generalizability of mechanisms are
characteristics that Durito must strive for, even at this early
stage.

The main areas of code are:

-- Durito::Global --  Contains all global constants and variables,
as well as functions that read in, set and provide configuration
information.

-- Durito::ANServer --  A small, incomplete HTTP server just for
Durito.

-- Durito::RP and Durito::RP::Processor --  Modules that handle
different Durito "instances" (windows, essentially) and pass user
requests to the appropriate parts of the program.

-- Durito::Browser, Durito::Netscape4, Durito::DefaultBrowser and
Netscape4_ext.pl --  Code that performs activities related to the
browser front-end.

-- Durito::Audio and Durito::MAD --  Deal with the "mocked-up" audio
front-end.

-- Durito::Analysis and Durito::Minisearch --  Code for the
"mocked-up" search functionality.

-- durito.pl --  The program's entry point.


Durito.pl's first action is to read in the configuraiton file
(durito.conf).  It then sets two handlers (to be called later by the
server module) and starts the server.

The first handler, &main::complete_startup, is called once the
server has been successfully started.  It completes all
initialization processes--namely, it initializes various modules and
starts a browser window pointing to the appropriate start page on
the server.

The second handler, &main::request_handler, is called by the server
every time a URL containing a query field is requested.  It acts
essentially as a "go-between" between the server and the modules
that actually process requests.  In other words, it receives data on
the query fields contained in the URL, passes them on to the request
object (see below), and then receives the HTTP response that is to
be sent back to the server.

The server will shut down when it receives a signal to do so.  (This
signal is set at the appropriate time by the modules that process
requests.)  The program then exits.

Global stuff contained in Durito::Global includes constants used to
refer to configuration data (prefaced with "DU_C_") and global
variables that are blessed into objects that provide APIs for
various components of the program.  (Global variables are prefaced
with "DU_G_".)

The mechanism for initializing different areas (front end, analysis,
audio, etc.) was designed with flexibility in mind, but it may be
more compicated than necessary.  It provides one degree of
separation between each area and the rest of the program, and allows
the configuration file to be used to determine some of the modules
that are actually instantiated.

Let us take, for example, the case of the Audio front end.  In
durito.pl, it is instantiated by the command

$DU_G_audio_front_end = Durito::Audio->new();

That method of Durito::Audio actually looks at data from the
configuration file to determine the name of the audio module to be
employed, loads the module, then calls the "new" method of the class
of the same name, and returns a reference that has been blessed into
an instance of that class.  Thus, later on, any methods called on
the $DU_G_audio_front_end object are sent to whatever package/class
is indicated in the configuration file, in the 
field.  For the time being, the only audio front end module
implemented is Durito::MAD; but others could be implemented, and if
they stick to the same API, the rest of the program should not have
to be modified in order to interact with them correctly.  A simple
change to the configuration file would be sufficient to subsititute
one module for another.

The mechanism for instantiating the browser-based front-end and
request processor is actually a bit more involved.  In this case,
the configuration file can mention several possible front ends that
the program might use.  The command

$DU_G_front_end=Durito::Browser->new();

causes the program to run through the list of front ends in the
configuration file.  It goes with the first one that instantiates
successfully.

(Which request processor is used depends on which front end the
program has started.  This mechanism might well change in upcoming
versions of Durito, however.)

Note that all this hullabaloo about Durito being able to use
different front ends is, for the time being, less important than may
appear, given that only one front end (Netscape 4.x) has actually
been implemented...  Actually, most other areas of the program are
in a similar situation.

For comments on why it's worth it to worry about possible different
front ends, please see section 3.5, "Front ends and Netscape 4.x".


>>>> 3.2 Text processing model



Durito has been built around a text processing model that assumes
(1) that most of the  documents to be presented to users are in an
interface-neutral, structured XML format, and
(2) it will sometimes be necessary to present a given document set
using different front ends.

This model is certainly subject to revision.  Although it rigorously
separates different kinds of processing, it may make interface
implementation and modification a bit cumbersome.  The different
levels of text transformation also slow processing.

In theory, it should work like this:

As text-based information is processed, it moves from
source-specific/situation-neutral to
source-neutral/situation-specific formats.  At level A, each kind of
data may have its own format.  When it reaches level B, all data has
been "smoothed down" to a single format, but no situation-specific
information has been added.  By level C, user-interface-specific
elements have been added, and by level D, all runtime-dependent
elements have been added.  Data may "join the process" at any level:
instead of starting out at level A, some data may already be in a
format appropriate for level B, C or D, and thus may enter the train
at that point.  After level D, all data is sent to the browser.

Here are the formats that are appropriate:

Level A
=======
Data may be in any format for which there exist mechanisms that can
transform it into a format appropriate for level B.

Level B
=======
A special format: "Level B proto-XHTML"

Level C
=======
Another special format: "Level C proto-XHTML"

Level D
=======
Any format that can be directly understood by the front end.

The current prototype of the Testimonios Zapatistas archive _sort of
partially_ implements this model with several of TZ's interviews.  I
should emphasize that the current program is deficient in the
following ways: (1) It does not make use of RDF to channel documents
to the correct processes, as would ideally be the case.  (2) It is
"hardwired" to work with the TZ archive.  (3) It is also "hardwired"
to use Netscape 4.x.  (4) Some of the text processing that it
carries out does not follow this model.  In addition, the two
varieties of proto-XHTML employed are not at all mature mature
formats; rather, they were thrown together quickly, in an attempt to
just "make it work".  A lot more refinement will likely be necessary
if Durito is to be used to publish different kinds of archives.

For the time being, the two kinds of proto-XHTML look like this:

Level B proto-XHTML
===================
A well-formed XML document containing elements that would be present
in the _body_ of an XHTML document, all wrapped in
 or  tags, which are in turn
contained within a  tag.  The document may
also contain special "entities", which are always surrounded by
double dashes ("--") and are replaced with other data during this or
subsequent stages.

Level C proto-XHTML
===================
A valid XHTML document that has been adapted for a specific front
end.  Can still contain a few special entities, which will be
replaced before the document is sent to the browser.  According to
this model, there will actually be many different varieties of Level
C proto-XHTML: one for each sort of front end that is to be
supported.

In the current implementation of the TZ archive, documents enter the
processing train at various points:

Start at Level A
================
Interview transcriptions
Table of contents
(Eventually, generated data such as search results will likely also
start here.)

Start at Level B
================
(Currently, the results from the mock-up Minisearch search engine
enter here.  Not shown on the diagram.)

Start at Level C
================
Elements of the Netscape-specific local user interface (for example,
frames_N4_01_es.xhtml)

Start at Level D
================
Basically, everything contained in the HtDocs directory.

Regarding transformations: transc_html_01.xsl and list_html_01.xsl
move data from Level A to Level B.  N4_01_es.xsl does some of the
work of moving data from Level B to Level C.  All the rest of the
transformations are performed using regular expressions.

The main problem with this model seems to be speed when dealing with
large documents.  The interview transcriptions, in XML format, can
be hundreds of kilobytes long; especially time-consuming is the step
from Level A to Level B.  To get around this, the program uses a
system of "shortcuts", sort of like a static cache: before the
archive is distributed, long documents are pre-processed,
transformed from Level A to B; so the archive actually contains such
texts in both formats.  When the system receives a request for a
transformation of a document from Level A to B, it first looks to
see if there is a pre-processed Level B version of the document
available.  If there is, it just loads that instead of spending time
doing the transformation.  However, it can still access the Level A
version of the document when it has to (such as to produce summaries
for search results).  The process of creating this static cache
could, in future versions, be automatized during the process of
building an archive.  (See "Durito in two acts" in the original
design proposal.)  There is also, of course, much room for
improvement in the runtime part of this mechanism.


>>>> 3.3 Requests and instances

Durito is designed to have an unlimited number of interfaces open at
any given time.  This is an obvious need for the Internet version of
the program.  In the local version, it is more of a luxury: it
allows a single user to simultaneously open several windows of an
archive.  Each window is referred to as an "instance".  These
windows do not represent distinct instances of the Durito backend;
they are just instances of the user interface.

"Requests" are simply requests for one or more actions by the Durito
backend.  They are sent via query strings in URLs.  All requests are
associated with a specific instance.

"Commands" are the specific actions that the backend should perform.

All instances, requests and commands are managed via a single
object, $DU_G_rp, of the class Durito::RP.  This is the program's
request processor.  Only one of these objects is generated per
execution of the program.

Internally, $DU_G_rp refers to a data structure that keeps track of
all instances, requests, and commands.  See RP_example.txt for a 
snapshot of that structure as seen by Data::Dumper.

It is the responsibility of this area to hand-off commands to other
sections of the program by calling their respective APIs.  Currently
this handing-off procedure is a bit sloppy, inconsistent and needs
re-organizing.  Durito::RP::Processor also contains code related not
to channelling commands but rather the processing of text documents. 
In the future, text processing will be moved to a different module
and will be placed under the control of RDF (as will other parts of
the program).


>>>> 3.4 URL mechanism

Since Durito's interface is always, for the time being, a Web
browser, it has at its disposal the mechanisms of user interface
communication provided by HTTP.  The current version receives only
GET requests; information about the request is placed in the query
section of the URL.  Here is an example:

http://127.0.0.1:1910/NonDoc.html?inst_index=1&cmd1=html_construct(b
ase_frameset)

Its parts are:

Scheme: "http" -- does not change.
Authority: "127.0.0.1:1910" -- determined by information in the
configuraiton file.
Path: "NonDoc.html" -- ignored in the current version.
Query: "inst_index=1&cmd1=html_construct(base_frameset)" -- contains
the request.

The parts of the query are: "inst_index=1", which identifies the
instance that is sending the request, and
"cmd1=html_construct(base_frameset)", which is the command that the
request processor will parse and execute.

Requests _must_ contain instance information.  If the interface is
requesting a new interface, the request may contain
"inst_index=new", and the program backend will generate an instance
and assign it a number.  The request "inst_index=temp" is also
valid, and is used for actions that need not be associated with any
actual instance.

One of the runtime-dependant elements added to text documents as
they move from Level C to Level D is instance information.  (See
above, section 3.2 "Text processing model".)

Requests may contain more than one command.


>>>> 3.5 Front ends and Netscape 4.x

In this text, and in the comments in the code, the terms "front end"
and "user interface" have been used more or less interchangeably. 
It is important to note that the area of the program responsible for
interactions with the user encompases more than just the external
browser program; it includes (but is not necessarily limited to)
modules that interact with, files sent to, and code executed within
the browser.

In this sense, the current front end should not be thought of as
just Netscape 4.x.  Instead, it is a Spanish-language, frame-based
user interface that employs Netscape 4.x and runs in a local
context.  Its name in the configuration file is: "N4_01_es".  It
includes elements in Perl, XSLT, XHTML, HTML, CCS, PNG and
Javascript.  They are:

External programs
=================
Netscape 4.x

Perl code
=========
Durito::Netscape4
Netscaep4_ext.pl

XSLT
====
N4_01_es.xml

XHTML
=====
frames_N4_01_es.xhtml
search_N4_01_es.xhtml
base_interface_N4_01_es.xhtml
wait_message_N4_01_es.xhtml
dynamic_interface_N4_01_es.xhtml

HTML
====
Splash_N4_01_es.html
Err_N4_01_es.html
Start_N4_01_es.html
Start2_N4_01_es.html

CCS
===
N4_tz_01.css

PNG
===
back.png
fwd.png
speaker.png

Javascript
==========
Javascript is included in several of the above elements.


There are a few other chunks that should, in some sense, be seen as
part of the user interface "layer", but are not part of N4_es_01:
Durito::ANServer and Durito::Browser deal with the browser, but were
designed as central parts of the program that would not necessarily
be replaced in other local implementations.  Durito::Audio and
Durito::MAD are part of the audio front end (see below).  N4_es_01
has been developed to control only the limited audio functions of
the MAD audio front end implementation; it would have to be modified
to be able to take advantage of a better, more complete audio front
end.

As readers can see, the creation of user interfaces is cumbersome in
that one has to keep track of how many disperse bits and pieces will
interact once they've been assembled.  I'm not sure the involved
text processing model is of much help here.

Another issue is: how to make it easier to translate an interface to
a different natural language.  It would perhaps be ideal to store in
a central location all messages presented to the user as part of the
user interface; this would facilitate the management of such
messages, including their translation.  Would it be too
time-consuming to always insert them at runtime?

Some readers might wonder why it's worthwhile to make it possible to
substitute one user interface for another.  Why not just create a
single, standards-compliant implementation?  There are several
reasons (most of which make most sense when one defines "interface"
as more than just the browser).  For one thing, it is desireable to
have different interface functionality in different contexts.  For
example, in the current interface, designed for a local context,
standard browser buttons and menus are removed from view.  This is
because the user in fact has no need for such functions in a local
context--and does not, in fact, even have to know that he or she is
using a Web browser.  In an Internet implementation of Durito, there
would be no reason to remove such controls; browser configuration
would be the user's business.  Another reason for allowing different
interfaces is that standards are constantly evolving; such
modularity should facilitate updating interfaces.  Finally, one
might wish to provide special interfaces for other devices--for
example, voice synthesizers or handheld devices.


>>>> 3.6 Audio front end: MADPLAY mock-up

The audio front end is more of a mock-up than most parts of the
program.  That is, it was just thrown together for the purpose of
creating a demonstration version of the Testimonios Zapatistas
archive.  It uses Madplay, the simple command-line interface for
MAD, a high-quality, free MPEG decoder.  This requires launching a
new process each time the audio is started, and killing it to stop
the recording; as a result, the program is slow to react to
audio-related user requests.  In addition, many of the usual audio
controls that users expect--pause, fast forward, seek, for
example--and not available.  Future versions of Durito must improve
on this situation.


>>>> 3.7 Analysis engine: MiniSearch mock-up

The search mechanism is really a mock-up, too.  It doesn't index (so
it can't deal with a large number of documents) and is hardwired to
work only with the interviews of Testimonios Zapatistas.


*****************************************

4. Summary: issues and plans

Much of the program, as it now stands, will likely be replaced.  The
parts that are mock-ups obviously must be.  It is also likely that
Durito::ANServer will be substituted for something more
conventional, such as Apache.  The general framework should at least
be cleaned up, if not replaced.  RDF functionality must be created
and integrated with the text processing model and other processes. 
Internet and GNU/Linux versions have not yet been set up.  For the
Internet, a browser-neutral, standards-compliant interface should be
created.  For the local version, we should create a new interface,
using a more modern and legally distributable browser.  Please see
the mailing list archives and the Sourceforge task manager for
related discussions and proposals.

Issues:
- The executable, PerlApp-ized version does not to work if installed
in the root directory of a CD-ROM--it must reside in a folder.
- If the user closes a browser window using the its "close" button,
the backend is not informed of this event.
- Current version uses a Flash movie in TZ the introduction.  This
is not free.
- PerlApp, the program I've used to create distributable Durito
executables, is also not free--neither in the beer nor the speech
senses of the term.


*****************************************

5. Change log

4 November, 2001
- Better documentation
- Nicer search form
- Fixed a few bugs and made minor improvements in Javascript code
- Made executable, distributable version
- Added a wee bit of minimal error handling
- Improved Minisearch
- Fixed messages in console window
- Added splash screen for TZ

4 September, 2001
- Audio playback using the "MAD" MPEG Audio Decoder.
- A non-indexing search engine implemented using regular
expressions.
- A fancier user interface, including a sort of "status bar" and
back and forward buttons.
- A "static cache" to allow anticipated executions of certain XSLT
transformations that take a long time.
- Other minor adjustments to improve performance, including a change
in the way interviews are rendered into HTML: the table that was
used to place elements on the page has been divided into a series of
shorter tables.  Also, information on character sets is now sent in
the HTTP header.

20 August, 2001
- A frameset with a stable menu bar
- Clean(ish) separation of various levels of text processing:
collection-specific, interface-specific, and runtime-data-dependant.
- A new package, Durito::RP::Process, for processing commands sent
with url queries
- A mock-up of a table of contents
- Multiple simultaneous windows (instances)
- Removal of the timed handler that constantly checked for open
Netscape windows (seemed to cause memory problems)

11 August, 2001
- Initial version containing much of the current framework.


*****************************************

6. License

Durito is distributed under the terms of the GNU General Public
License.  Copyright (c) 2001, Andrew Green.  Copyright of components
is held by their respective authors.  Material of the Testimonios
Zapatistas archive is copyright (c) the Instituto Nacional de
Antropología e Historia, Mexico.