Maximising the Power of Semantic Textual Data: CASTEMO Data Collection and the InkVisitor Application

Zbíral, David; Mertel, Adam; Shaw, Robert L. J.

Maximising the Power of Semantic Textual Data: CASTEMO Data Collection and the InkVisitor Application

Zbíral, D.; Mertel, A.; Shaw, R. L. J.

In this paper, we present Computer-Assisted Semantic Text Modelling (CASTEMO), a novel approach to transformation of textual resources into deeply structured data stored in JSON-based document databases. We also present the InkVisitor application which assists this data collection workflow and helps validate the data. Both the workflow and the application were developed within the Dissident Networks Project (DISSINET, https://dissinet.cz).

CASTEMO is based on widespread ideas, such as the idea of semantic data (e.g. Semantic Web) and the syntactic structure of natural language sentences (in our case, subject-verb-object1-object2 quadruples), and we acknowledge convergent developments (mainly Roberto Franzosi’s Quantitative Narrative Analysis). Nevertheless, we follow our own path towards deeply structured and deeply semantic data drawn from texts which allow us to preserve, and thus quantitatively analyse, e.g.:

the order and syntactic embeddedness of information;

the textual embeddedness of information (i.e. who is speaking, to whom, and in what context);

the original language, expression, and discourse;

the distinction between epistemic levels.

CASTEMO thus offers a time-intensive but extremely powerful alternative to (1) text mining, which often fails to answer fine-grained questions, and (2) Computer-Assisted Qualitative Data Analysis Software (CAQDAS), where opportunities of quantification are too incidental and severely limited by the original hypothesis. CASTEMO should be of interest to projects interested in quantitatively analysing information strictly in the context of its production (“source criticism 2.0”), and looking at the discourse of texts.

In this paper, we present the foundations of this data collection workflow, its selling points, as well as caveats for potential users. We also provide a first public presentation of InkVisitor, an open-source browser-based application implementing the CASTEMO workflow.

Keywords: digital humanities; data collection; textual mining; text processing

Lecture (Conference)
Computing the Past: Computational approaches to the dynamics of cultures and societies, 06.-8.10.2022, Pilsen, Czechia

Permalink: https://www.hzdr.de/publications/Publ-35767