Linked Open Data for Written Artefacts / ENCODE Advanced Training
Hamburg 26/5–28/5 2021 (Zoom)
The training will introduce participants to basic Linked Open Data technologies and show techniques to produce, store, visualize, query, reuse, and share data as Linked Open RDF Data.
Based mainly on the experience of the Beta Maṣāḥǝft project the training will use examples from epigraphy, codicology, and papyrology and will welcome further diverse datasets to be connected.
Trainer: Pietro Liuzzo
These video introductions touch on similar contents, in limited time, and are very useful to get into the themes. The theory is introduced in clear and simple ways. There are related activities and exercises which can be used as how-to.
Viaf http://viaf.org/viaf/data/
Pleiades https://pleiades.stoa.org/downloads
EDH Data (latin inscriptions) https://edh-www.adw.uni-heidelberg.de/data
Beta maṣāḥǝft Endpoint (Ethiopian manuscripts) https://betamasaheft.eu/sparql
Papyri.info (documentary papyri) see each entry, links at the bottom of the page
Nomisma (coins and numismatics) http://nomisma.org/
IDEA Community (Zenodo) https://zenodo.org/communities/eagle-idea?page=1&size=20
Arachne (Archeological data) https://arachne.uni-koeln.de/drupal/
Data hub https://old.datahub.io/dataset?q=CIDOC-CRM (including British Museum)
Wikidata https://www.wikidata.org/wiki/Wikidata:Main_Page
Dbpedia https://wiki.dbpedia.org/
Europeana https://pro.europeana.eu/page/linked-open-data
PeriodO PeriodO – Periods, Organized
GODOT https://godot.date/about
prefix.cc: namespace lookup for RDF developers
Linked Open Vocabularies (LOV)
Ontology Management Environment
9–9.15 Introduction to the training
9.15–10 Wonder me breakout session
10–11 Presentations
Creating an ontology is not easy. Not because of the data or the code, but because of the ontology. Knowledge modelling is not easy. The data models used to represent knowledge need to be easy and are. In this exercise we will try this out.
Instructions
Download and install Protege https://protege.stanford.edu/products.php#desktop-protege or login to use Web Protégé online.
Step 1. You will design your ontology from scratch. We do not want to build the one which will be used, but the one which represents our understanding. Examples in the wild include pizzas, movies, cars, etc. We will model " Books". Feeling Lucky? Try "Manuscripts". Get started how you want. Imagine the user scenario you want. A library, a book shop, your own library, a reading list… Pick the side of representation which you want and model it. Which Classes do you need? Which Properties to connect them? Which vocabularies to point to?
Step 2. Check out which resources you already know (FRBR(oo), Wikidata, Dbpedia, Viaf, etc.) and look for existing ontologies. How and what to align? Perhaps it would have been better to start from this instead?
Questions come before Answers. It is way more important to be able to build a query than to read the results. In SPARQL the result's format is defined with the query. In this exercise you will practice with the formal language to build your queries. You will need to look at the data, and think of the result format.
Instructions
First try to reproduce in the Wikidata query service a query from the examples.
Now produce your own SPARQL query to one of the Endpoints listed in the presentation. You may stay on Wikidata, if you wish and you have identified relevant data for you. You may simply modify one of the examples.
Share your query here. An example (the one below) is provided.
Now try to get a map of the current location of "written artefacts" from Wikidata.
You can start from this example query
#defaultView:Map
# Map of Codices in Wikidata at their location (Q213924)
SELECT \*
WHERE {
?manuscript wdt:P31 wd:Q213924 ;
rdfs:label ?itemLabel .
OPTIONAL{?manuscript wdt:P276 ?location .
?location wdt:P625 ?coordinates .}
}
Try and navigate the superclasses of Q213924 and investigate, by navigating an example (e.g. Q204221), the location property. What do you think?
We want to use the linked data which is available. Finding and getting the data is not the simplest of things. To play with your data, start by setting up your own triplestore.
Instructions
Set up a Fuseki Triplestore locally (slides have been presented). If you do not like Fuseki, you can consider Blazegraph. For this I do not have hints regarding configuration, I have never used it, it is however the tool of choice of the edition of Bufalini's book object of this article.
You can enrich existing data directly, by editing Wikidata. Beta Maṣāḥǝft would love it if you added a few identifiers from the project to Wikidata Entities! Or make annotations with Hypothes.is which use tags pointing to Beta Maṣāḥǝft resources. Make an account and add statements.
You can use software which does the management of things for you.
Annotate with Hypothes.is, based on Web Annotation standard, within the workshop's group.
Use the URI of the annotations in your own triples.
You can also try and produce with a tool some Linked Open Data and reuse it. For example, log in to Recogito, upload a document or an image and annotate it. Click the download button and download your annotations as RDF.
You can also try this Interoperability test, which aims at getting linked resources, one connection for each. Here I am suggesting some datasets, you can of course check out your own.
Import all the pieces you have collected into your Fuseki instance. Query!
Interacting with controlled vocabularies is as intuitive as it is easy to do wrong (but no worries, no one is going to die!). Try and believe.
Instructions
Look at the Language of Bindings Thesaurus (LoB), https://www.ligatus.org.uk/lob/
Pick a manuscript from Beta Masaheft, look at the images and put together some triples with any editor, which would build up a description of the binding using the Ligatus vocabulary. Note the RDF representation of the manuscripts, containing URIs, which is provided at the bottom of each page. Do not know how to model these? Perhaps this article can help.
Do not like manuscripts? Why not download some RDF for a Papyrus from Papyri.info instead? Instead of Ligatus you may link some of the information present in the metadata to the EAGLE Vocabularies or the Getty Art & Architecture Thesaurus.
In Day 1 we have modelled with Web Protégé a generic domain of Books. Now that you know where to look at, try and imagine how to model your own data, not all of it, just a piece which is more immediately interesting to you. Use a flexible and accessible tool: pen and paper.
Instructions
It would not be linked open data if we did not use it to reach out. We will try and put together what we have done and see if something can come out of it. Store your outputs in your favourite repository and share in this document what you can and would like to share.
Below you can find a chronologically ordered list of publications which I have read and I think can be useful for you. Because some are lengthy and one cannot read them all, I have added for each a short introductory paragraph with my comments in relation to the content of the training. These, I hope, will help you choose what is more relevant for you.
This is really what it says. The foundation of Semantic Web Technologies. You will also find exercises and further readings for each chapter. Worth as reading and introduction, although any one of us is likely only to deal with one or the other of the chapters. But that is part of what the Semantic Web is about, extreme modularity. Be ready for more maths and logic than you are used to.
A provocative and really in progress set of notes, starting very critically against Semantic Web and ending up with a positive take to SPARQL endpoints… Aaron Swartz, one of the heroes of the Web as we know it, also gives an introduction to the overall architecture of the Web which is synthetic and colloquial. This short booklet is a quite useful overview to understand how the Web actually works. And start worrying about the enormous amount of overhead on any digital effort in terms of sustainability. Read this and some detailed report on data centers energy consumption, as well as some additional post-colonial minimal computing efforts and you will soon start wondering if it is not better to simply turn all off and write on paper only.
The SAWS ontology is one of the most interesting applications for research purposes of owl formalisms. It is a bit hard to find the code and its documentation, but this does not make of it a lesser achievement. I find this page to be a very useful complement to the presentation and the contents of some of the videos above. This is also, as far as I know, the origin of a very productive practice of using the \<relation\> element to include additional statements as triples in a TEI description.
The modeling of Texts as linguistic objects in CRMtex is a good example of a fine effort which is as much useful as it is unused. Engineers may think of that as a non-positive result, I think it is a matter of time. Before this was in place it could not be used. Now that it is, we know where to look at to model some core features for written artefacts (not only epigraphy, I would claim).
The entire COMST manual is a must read for opening up the perspective from more abundant or more european manuscript traditions. This contribution by Gippert makes a point about the "digital approaches".
One interesting (and concerning) thing about this quite useful introduction is that it is retired. It does not work any more in some respect. I think it is an experience we are better start to get used to, that of writing about things which will decay in a timespan which is much shorter than anything we imagined. The article introduces to RDF using some of the resources we have also described. The fact that some do not work may prompt you to try on those resources which you are interested in.
Start from this, to have a solid introduction to why Linked Data is important. One statement is all it takes to make a difference! CIDOC is introduced here by its creators and a positive argument is made against a quite wide spread absurdity, that is that of opposing as alternative TEI encoding and LOD practices. These are done differently and for different purposes, they go hand in hand, they are not alternatives neither as views of the world or digital practices, etc.
One of the easiest to encode, yet most concerning parts of the digital work for scholars in the humanities is that of representing various forms of doubt and uncertainty. How often do we feel that one or the other model wants us to state a fact which we are actually not ready to state? Even stating that with doubt may be not what we want to do. The volume collects several contributions which address from different angles this fundamental issue. I would like to suggest that inference is another area where this is exponentially problematic.
SPARQL and Semantic Web technologies can be used to map on the fly "traditional" SQL databases and aggregate them without having to do much more than a properly standardized mapping. Since we hope SQL databases which have been curated for decades will remain with us, this is indeed a valuable way to bring them into the game of the present and future, which is that of Semantic web. Ontop does this, and works very well.
It is very easy to find prophets of LOD, it is thus important to hear some other bell. This short reading makes in my opinion a very simple and clear point: purpose is at the heart of what we do, so let us do LOD and RDF if we are using it, not just to do it because it is good or because Sir Tim Berners-Lee said it is good for us to do so.
This theoretical introduction is an accessible reading to get started. Eventually, I hope you will not need it, but it may be useful as a refresher in a couple of years if you do not use for the moment any of this and then, e.g. get funding and need to get started again.
The article discusses the impact of digital tools on digital editions and their use by philologists. Examples are taken from Euripides Scholia and Homer Multitext. The latter project is one that challenges all together the notion of critical edition opposing to it a new paradigm, that of a multitext edition. The author (and we have also heard directly from him in the evening presentation) discusses some defining aspects and challenges a rather positivistic rhetoric of the digital champions.
Here the authors discuss the use of the CIDOC-CRM (in theory) to record information on changes of the binding through time. The examples are linked to the LoB Vocabulary, providing a good example of how to connect models and specific vocabularies. The use of graphs instead of readable CURI based makes the access to information a bit more challenging. The authors make an important point about prioritizing observation and recording to deduction and unpacking that information.
This article offers an example of how to work with features of a digital edition, the modelling of which benefits from linked data structures and builds further compared to the document-centred models. It gives also an example of use of w3id.org for the stable identification and long term support of URIs, which however must be in conjunction with server side support for the resolution of these. Integration of the data model with SPAR ontologies is also presented with examples and further references. The citation structures using Web Annotation models may be integrated with DTS API presentations for example.
If you work with Digital Classicists you will soon meet the work of Hugh Cayless. We have seen for example the LAWD ontology, which he put together. In this article he points out some uses of RDF and Linked Data for classicists and also gives some historical details on the development of resources like Pleiades, Papyri.info and opencontext, with a view at sustainability and complexity. I strongly suggest this reading also to understand some of the risks involved in this kind of work (over encoding, to name one).
In my book, especially in Chapters 2, 4, and 7, you will find a number of SPARQL queries with explanation and several other examples of XQuery and XSLT in action. This is all applied to Ethiopian manuscripts but entirely applicable to any other context.
I have learned more from this book (the third edition) than from any other scattered resource. Unfortunately examples are all out of scope, but still they are clear and understandable by anyone. If you plan to work with RDF and Semantic Web technologies, I would recommend this.
This is a special issue on Semantic Web for cultural heritage. Let alone the fact that hard sciences need to cluster all of us into something which is way too big to be of any relevance, the issue contains various interesting contributions. I would in particular point out the one on Gravsearch for implementers and the contributions about CRMtex, the execution technique modeling and the narrative ontology.
Also this volume, as the previous, is a collection of contributions on several aspects of Linked Data. Why the Mediterranean was picked for the title, I do not know. The resources and ontologies described reach far away from the "lake". Look at the contribution about PeriodO, which has a global scope. Sebastian Heath's contribution will also walk you through using SPARQL and adding visualization straight into a 'live' IPython Notebook.
The latest version of the CIDOC-CRM. The most important part is the one which lists changes from previous versions. Some rather important ones are in place now. No standard is ever closed or finished, no resource is, but this is a reference work which should be always on your desk(top), when doing linked data. For those who think this is more complex than TEI, print the TEI Guidelines, which should also be kept on the desktop, and let me know what goes better in terms of reading. Both these resources are a basic reading for anyone doing digital work.
The Epigraphy.info Ontology Working group produced this document as a collaborative effort over the last few years. No one uses it yet. It attempts at going through the various types of information needed for epigraphy in the same way this was done in Nomisma for coins. It additionally uses a list of previously existing vocabularies specific to amphoric epigraphy for example, and an implementation of the CRMtex which did not exist before.
At the end of the training, participants will have received initial training to be able to
− Understand RDF structured data, query RDF with SPARQL, and visualize data returned from a SPARQL Query
− Know the main standards (IIIF, CIDOC CRM, DTS) and guidelines used in the field of interest and know where to retrieve further documentation
− Know basic tools available to produce RDF and OWL data (Protégé, Atom)
− Transform data to RDF (Xtriples) and store it (Fuseki)
− Use with ease some simple reference materials (Wikibase) and annotation tools (Hypothes.is, Zenodo)
− Apply simple models to their data to produce RDF and use it to answer simple questions
− Analyse inquisitively the results of the process
− Understand the scientific processes and workflows involved with the modeling and querying of RDF data
− Understand and apply to their own needs the best practices of Linked Open Data.
Linked Open Data for Written Artefacts Program © 2021 by Pietro Liuzzois licensed under Attribution-NonCommercial 4.0 International
Part of the Project Bridging the \<gap\> in Ancient Writing Cultures: ENhance COmpetences in the Digital Era