An overview of Oracc projects

The basic unit in the Oracc system is the project. This document gives a basic introduction to how projects are organized and how to create and work with them.

What is a Project?

From the user's perspective a project comprises one or two components:

a portal is a collection of web pages, perhaps with images and downloads, which describe some aspect of cuneiform culture;
a corpus is a collection of texts which have glossaries and possibly some pages of explanation.

A project minimally contains a portal but may also contain a corpus that is related to it.

From the system perspective, there are actually two kinds of projects: main projects (which we call simply `projects') and subprojects. Subprojects are a way of dividing up projects for various reasons. For instance, the CAMS subproject Ludlul is found in the cams/ludlul/ directory.

Setting up a project

Projects are the organizational core of the Oracc server.

defining a project on the Oracc server is optional; you can work with most of the tools without doing so;
but having a project makes it easy for you to store control lists of graphemes and lemmata, and view your data online even while you are developing it;
a project provides an easy way to present your work online;
email osc at oracc dot org to arrange this.

Before we begin, it is useful to explain the fundamentals which are available to all projects.

Portal

A project has at least one portal page, which may contain links to the corpus. The portal website is hosted on the same server as the corpus data, and may also be located elsewhere (for instance if required by a funding body).

Files containing editable portal content live in the 00web/ directory. Files containing static content, such as images and downloads, live in the 00res/ directory.

The link to the portal for a project `cams' is:

http://oracc.museum.upenn.edu/cams

Help on setting up a portal is given here.

Corpus

Most projects relate in some way to a text corpus. The texts are entered or converted to the ATF format and may have translations. The project management software takes care of turning the ATF sources into the various formats used for web display and other purposes.

Working with transliterations

Transliterations are the core of a corpus.

convert legacy data to ATF
add new texts by typing them in ATF
validate your transliterations using the ATF checker
you can optionally use your own control lists for data content like allowable grapheme values and more

Working with translations

Translations can be integrated into the corpus.

Translate the texts or convert legacy translations to ATF.
Publish the corpus to the web with searching or in print with indexes.

ATF files containing transliterations and translations are kept in a project's 00atf directory on the Oracc server.

Pager

The pager is the name given to the web-interface which enables users to interact with the corpus. The pager understands how to present long lists of results in pages, and also how to assemble metadata, texts and translations into pages displaying individual texts.

In your web browser you can jump directly to the corpus pager by using the keyword `corpus' after the project name in the URL. Compare http://oracc.museum.upenn.edu/saao [http://oracc.org/saao] and http://oracc.museum.upenn.edu/saao/corpus [http://oracc.org/saao/corpus].

Catalogue

While it may not be obvious, the most fundamental part of any corpus is the catalogue which provides the text metadata--at the very least the CDLI ID and a human-readable designation--which provides the organizational basis for all other components of the project.

The easiest way to provide a catalogue for a corpus is to derive the project dynamically from the CDLI catalogue. However, some projects have special needs and in those cases it is possible to tailor the catalogue processing software to the required metadata fields and values.

Working with P-numbers

P-numbers are unique identifiers required by the tools. To get P-numbers for you tablets:

send a brief catalog of texts to CDLI staff
initially such a catalog might only contain identifying information such as museum and publication numbers and a few additional fields giving the author and date of the primary publication and the owner of the objects
information on period, provenience and genre is desirable--and is used by the web-based browse and search tools--but may be added later
further fields are useful but not absolutely required.

If a project has its own catalogue, that is kept in the 00cat/ directory.

Glossaries

The ATF format supports lemmatization, which is the process of adding references to dictionary headwords into the texts. If a corpus is lemmatized, it can be used to generate glossaries directly from the texts with no glossary-editing at all. Normally, however, the glossary and text corpus are used together: the glossary is maintained and may be edited or augmented with bibliography, and the corpus is synchronized with the glossary so that all of the instances of terms are instantly reachable from the glossary articles.

Adding linguistic annotations

Linguistic annotation makes a corpus more useful.

you can add sentence boundaries to the transliterations by simply typing +. in the appropriate places
lemmatization, identifying the word to which each form belongs, is particularly useful
the Oracc tools provide a straightforward procedure for making lists of forms and their lemmata

Glossaries are generated from the ATF files when a project is rebuilt. They live in the 00lib/ directory.

Text Lists

Lists of texts can be handled in either of two ways: as LIST files or as URLS.

List Files

List files are simply files containing P, Q or X IDs. They must be placed in the directory PROJECT/00lib/lists/, e.g., in cams/00lib/lists/. The rebuild process installs the lists in the proper place. You can then refer to your list by name.

After creating a list file in the CAMS project with the name 00lib/lists/ritual-drawings and the content:

P363719

You can then refer to http://oracc.museum.upenn.edu/cams/ritual-drawings.

URLs

For small numbers of texts, it is convenient to give the P, Q or X IDs in a comma-separated list after the project name:

http://oracc.museum.upenn.edu/saao/P334278,P334279

Structure

Users

The project organization is intended for use with multi-user systems. At the operating system level, each project is a user with a password and a home directory.

Subprojects

Projects can also own subprojects, which also means that regular users on a system can have their own personal projects.

Folders

The files used by a project live in several different folders (aka directories). The most important of these are:

00atf/: Contains ATF files, conventionally with a .atf or .txt extension.
00cat/: Optionally contains an XML export of the project catalogue (if not using the CDLI catalogue).
00lib/: Contains project configuration files and the glossary files.
00res/: Optionally contains static downloads for the project's portal pages.
00web/: Contains web pages and web configuration files which used when the project is rebuilt.

Management

Interface

Project management tasks are carried out by logging on to the Oracc server with a terminal programme and typing some simple Unix commands. Images and files can also be uploaded by drag-and-drop. For more detailed information, see the page on Project Management with Unix.

The `oracc` command

Once logged in as the project-user on the server, most tasks are accomplished via the program oracc, which is fully documented on another page.

Procedures

Project files are stored on the Oracc server, currently http://oracc.museum.upenn.edu [http://oracc.museum.upenn.edu]. A stable version of the project is publicly viewable on one or more web servers, currently also http://oracc.museum.upenn.edu [http://oracc.museum.upenn.edu].

You can build your portal and corpus independently from one another: rebuilding one does not entail rebuilding the other.

For fuller instructions, see the pages on Project Management with Unix.

Portal

Write your portal pages in ESP and upload them to 00web/. Place any static content for your portal in 00res/.

Run the oracc command to update the portal.

Catalogue

If you are using the CDLI catalogue then no action is required. If you are using a custom or local catalogue, the project must be correctly configured, then the catalogue updates must be placed in the 00cat folder with the file name(s) the project has been configured to use.

There is a separate page about setting up your own project catalogue.

Corpus

Transliterations should be placed in the 00atf/ folder. There can be one big file, one file per text, or something in between; the rebuild process uses all the relevant files in 00atf/.

When new texts are added, simply run the oracc command to update the website, indexes, etc.

Glossaries

The recommended workflow for glossary building is:

begin with text data which is ATF-clean.
lemmatize the texts; ensure they are ATF-clean with lem-checking, then add them to the 00atf directory.
run oracc harvest.
review 01bld/new/*.new and fix 00atf/*.atf and/or 00lib/*.glo as required.
run oracc merge [LANGUAGE] (this automatically redoes the harvest).
if something goes wrong, you can retrieve the previous *.glo file from the 'backups' directory--multiple oracc merge [LANGUAGE] commands on the same day overwrite the same file.

18 Dec 2019 osc at oracc dot org

Steve Tinney & Eleanor Robson

Steve Tinney & Eleanor Robson, 'An overview of Oracc projects', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/help/managingprojects/projects/]

An overview of Oracc projects

What is a Project?

Setting up a project

Portal

Corpus

Working with transliterations

Working with translations

Pager

Catalogue

Working with P-numbers

Glossaries

Adding linguistic annotations

Text Lists

List Files

URLs

Structure

Users

Subprojects

Folders

Management

Interface

The oracc command

Procedures

Portal

Catalogue

Corpus

Glossaries

The `oracc` command