The Making of the CorDis Corpus: Compilation and Mark-up
MetadataShow full item record
The CorDis Corpus is a large multimode, multigenre collection of political and media discourse on the 2003 Iraqi conflict. It was generated from different subcorpora previously assembled by various research groups for diverse discourse analytical purposes. A more detailed description of its composition can be found in the introduction. A significant portion of our work was devoted to making the subcorpora into a unified homogeneously encoded corpus which could be interrogated using Xaira. Initially the corpus was only lightly encoded by each research group on the basis of specific research objectives and hypotheses. The heterogeneity of data, the specificity of the genres and the various methods adopted involved the use of a wide range of coding strategies to make textual and meta-textual information retrievable by means of available concordance software. It was clear from the outset that marking up the corpus as a whole would entail various levels of pre-encoded and pre-existing interpretation. The main purpose of this paper is to show the process of standardization and integration whereby a loose collection of texts has become a stable architecture. The TEI Guidelines proved a valid instrument providing for a hierarchical organization of metadata which makes mark-up part and parcel of the corpus. We will underline that it is precisely the mark-up which gives the corpus a sound structure favouring the replicability and enhancing reliability of research. In discussing some examples we will deal with issues like conformity and validity, and we will examine the constraints imposed on data handling by the methodological framework adopted. In particular, we will argue that the crucial role of annotation leads to a reconsideration of the definition of corpus itself, in which special emphasis is placed on mark-up being the backbone of the corpus rather than a superimposed accessory. Finally, the fact that mark-up involves a substantial amount of human intervention on machine processed data has some crucial implications for corpus assisted discourse studies (CADS), since it permits the combination of qualitative and quantitative research approaches. There is a tendency to distinguish between ‘mark-up’ and ‘annotation’ (McEnerey, Xiao, Tono 2006: 29), adopting the first term to refer to contextual information (i.e. editorial and descriptive metadata) and the second to refer to ‘interpretative linguistic information’. We will here use the two terms interchangeably, since both notions share the same salient qualities for the purposes of our description: they are both added value and they both carry interpretative information.