1. Introduction
PublishMyData (PMD) is a [LINKED-DATA] software platform that helps government organisations to make their data more accessible and re-usable.
The platform combines a variety of existing standards. Not every part of every standard is adopted. In some cases it’s been necessary to extend or revise standards.
This document describes an application profile for PMD to explain how these standards have been woven together.
You can use this document to understand how to create and work with data that’s compatible with the platform and it’s features. We describe each of the main types of resources and explain how they ought to be described.
You can find a suite of tests in the PMD RDF Validations repository. These queries are designed to find examples of violations. We link to the relevant validations throughout this profile.
2. Namespaces
The following namespaces are used in this document:
prefixrdf : <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefixrdfs : <http://www.w3.org/2000/01/rdf-schema#> prefixfoaf : <http://xmlns.com/foaf/0.1/> prefixdcterms : <http://purl.org/dc/terms/> prefixdcat : <http://www.w3.org/ns/dcat#> prefixqb : <http://purl.org/linked-data/cube#> prefixsdmxd : <http://purl.org/linked-data/sdmx/2009/dimension#> prefixsdmxa : <http://purl.org/linked-data/sdmx/2009/attribute#> prefixsdmxcode : <http://purl.org/linked-data/sdmx/2009/code#> prefixscovo : <http://purl.org/NET/scovo#> prefixinterval : <http://reference.data.gov.uk/def/intervals/> prefixxsd : <http://www.w3.org/2001/XMLSchema#> prefixui : <http://www.w3.org/ns/ui#> prefixpmdcat : <http://publishmydata.com/pmdcat#> prefixgeo : <http://www.opengis.net/ont/geosparql#> prefixgeof : <http://www.opengis.net/def/function/geosparql/> prefixpmdgeo : <http://publishmydata.com/def/pmdgeo/> prefixpmdkos : <http://publishmydata.com/def/pmdkos/> prefixskos : <http://www.w3.org/2004/02/skos/core#> prefixeg : <http://example.com/>
3. Catalog
PMD allows publishers to organise datasets into catalogues.
The PMDCAT vocabulary builds on the Data Catalog Vocabulary Version 2 [vocab-dcat-2] creating subclasses of some of the key entities to reflect that some additional assumptions are made.
DCAT2 does not allow you to treat the catalog metadata differently to the datasets contents. We consider this important as it allows a cleaner separation to distinguish the metadata that may come from an upstream author from the additional metadata that a curator wishes to add.
These catalog extensions are defined in the pmdcat ontology.
The classes pmdcat:Catalog
and pmdcat:Dataset
are sub-classes of their DCAT equivalents.
We introduce a pmdcat:DatasetContents
class to identify the resource which represents the contents. This in turn has a variety of subclasses, such as pmdcat:DataCube
.
We also use pmdcat:metadataGraph
and pmdcat:graph
to specify where the dataset metadata and data contents are stored.
These resources fit together as per the below example. A explanation of each piece follows.
eg : catalog-graph { eg : datasets-catalog a pmdcat : Catalog , dcat : Catalog ; } eg : census-metadata-graph { eg : datasets-catalog dcat : record eg : census-catalog-record . eg : census-catalog-record a dcat : CatalogRecord ; foaf : primaryTopic eg : census-dataset ; pmdcat : metadataGraph eg : census-metadata-graph . eg : census-dataset a pmdcat : Dataset ; pmdcat : datasetContents eg : census-cube ; pmdcat : graph eg : census-data-graph . eg : census-cube a pmdcat : DataCube . } eg : census-data-graph { eg : census-cube a qb : DataSet ; eg : obs1 a qb : Observation ; qb : dataSet eq : census-cube ; . # etc }
3.1. Catalog
The catalog is a dcat:Catalog
and a pmdcat:Catalog
. It’s description need only include an rdfs:label
which will appear in the main navigation menu and serve as a title on the catalog’s page.
3.2. Catalog Record
The catalog is populated with records bearing the type dcat:CatalogRecord
. Records must be attached to a catalog with dcat:record
. They ought to be described with the following properties:
foaf:primaryTopic
-
the URI of the resource being catalogued, there must be only one topic and it must be a dataset (datasets are explained below).
pmdcat:metadataGraph
-
to point to the graph where the dataset metadata is stored, there must be one metadata graph and it must be an IRI.
The record’s description may also include an rdfs:label
. Though not generally displayed in the UI, this provides a label for the catalog record (rather than the dataset).
The record should be described in the metadata graph.
3.3. Dataset
The dataset has the type pmdcat:Dataset
. It ought to be described with the following properties:
dcterms:title
-
This is used to label the dataset anywhere it appears (e.g. on the catalog and on the dataset’s page). A dataset must have a title. Ideally the title should have a language tag too.
rdfs:comment
-
A short comment, appearing in the snippets in the catalog. Should be plain text.
pmdcat:graph
-
One of more graph(s) that store the dataset contents (e.g. the cube or vocabulary). At least one graph must be specified and they must be IRI(s).
dcterms:license
-
A string literal or preferably a URI like the Open Government License. A dataset may have at most one license.
dcterms:publisher
-
A URI for an organisation, typically taken from the gov.uk organisation register. The organisation resource ought to have an
rdfs:label
. A dataset may have at most one publisher.
It may also be described with the following properties:
pmdcat:datasetContents
-
The URI representing the dataset contents if a root resource exists. For RDF data cubes, this is the cube URI. For a concept scheme it would be the concept scheme URI.
dcat:theme
-
This is used to filter datasets in catalog.
dcat:keyword
-
Zero or more string values .
pmdcat:markdownDescription
-
A longer description of the dataset for the About tab. Can contain markdown to support formatting. If so, should have a datatype of markdown:Resource. A dataset may have at most one markdown description.
dcterms:creator
-
A URI of an organisation. A dataset may have at most one creator.
dcterms:contributor
-
A URI of an organisation. A dataset may have more than one contributor.
dcat:contactPoint
-
Typically a mailto link. Must be an IRI. A dataset may have at most one contact point.
dcat:landingPage
-
The URL of an external page with more info about this data.
void:sparqlEndpoint
-
The URL of the site sparql endpoint.
dcterms:issued
anddcterms:modified
-
Refering to the dataset (contents) itself (not the catalog record). An issued date must be provided. PMD keeps tracks of the date when the dataset contents (in the
pmdcat:graph
graph) is modified so you don’t need to provide this. PMD may optionally be configured to instead display values you provide in which case a dataset may have at most one one modified date.
user research indicated that it’s important to know the upstream dates. [Issue #Swirrl/cogs-issues#35]
The dataset should be described in the metadata graph, while it’s contents should be in the data graph(s). These graphs ought to be different.
There may also be additional metadata properties configured for your PMD instance.
How to document configurable metadata publically? The trade-metadata.ttl is private (and mightn’t even be kept up-to-date). Other fields like pmdcat:markdownDescription
are configurable (could be dcterms:description
on another instance).
explain more about license options? ODRS/ CC
3.4. Dataset Contents
The dataset contents should be an instance of pmdcat:DatasetContents
. It may also have an additional type taken from one of the sub-clases of pmdcat:DatasetContents
. The sub-classes are used to determine which user interface features to offer for the resource. The following types are supported:
pmdcat:DataCube
-
For
qb:DataSet
s from the RDF Data Cube standard [VOCAB-DATA-CUBE]. See the Data Cube section for more details. pmdcat:ConceptScheme
-
For
skos:ConceptScheme
s from the SKOS standard [skos-reference]. See the Codelists section for more details. pmdcat:Ontology
-
For
owl:Ontology
s from the OWL standard [owl-ref]. pmdcat:GraphDatasetContents
-
For arbitrary graphs of RDF, with no obvious root node.
Arbitrary RDF graphs don’t need a :dataset pmdcat:datasetContents :contents
or :graph a pmdcat:GraphDatasetContents
triple. Instead it’s sufficient to provide :dataset pmdcat:graph :graph
.
If a pmdcat:datasetContents
is specified there may be only one contents and it must be one of this supported types.
3.4.1. Self-reference
It’s also possible to use one resource that is the the dataset and contents:
eg : self-referencing-dataset a pmdcat : Dataset , pmdcat : DataCube , qb : Dataset ; pmdcat : datasetContents eg : self-referencing-dataset .
This simpler form will not be appropriate if you’re re-publishing an existing RDF dataset which includes metadata (e.g. dcterms:modified
) that you don’t want to present in the catalog.
4. Data Cube
Statistics are the most common type of data published with PMD. We adopt the RDF Data Cube standard [VOCAB-DATA-CUBE].
We use most of the classes as per the spec: qb:DataSet
, qb:DataStructureDefinition
, qb:ComponentSpecification
, qb:ComponentProperty
(i.e. qb:DimensionProperty
, qb:AttributeProperty
, qb:MeasureProperty
), qb:CodedProperty
and of course qb:Observation
.
We provide a variation to the integrity constraints declared in the vocabulary. In general these are designed follow the intent of the original but offer conveniences to:
-
identify violations (i.e.
SELECT ?example
) rather than simply detecting them (i.e.ASK
) -
split some conditions so that distinct causes may be distinguished
-
obviate the need for reasoning by specifying entailments with query patterns.
In the case of IC-15, however, we adopt a different model. The specification requires that all observations provide a measure value. We permit that a data marker may instead be provided in place of a value as explained below. We also extend IC-16 to prevent multiple values for any given measure.
Furthermore we introduce some additional requirements. Dimensions must be labelled. If they provide a qb:codeList
then it must be a scheme (i.e. an skos:ConceptScheme
containing skos:Concept
s); see the codelists section for more details.
PMD4 doesn’t currently use qb:Slice
, qb:SliceKey
, or qb:ObservationGroup
.
Explain ideas around table/ time-series/ map slices?
4.1. SDMX
Following the RDF Data Cube standard, we recommend that Statistical Data and Metadata eXchange dimensions be used where possible. The standard provides an RDF vocabulary for SDMX 2009 Dimensions. The most commonly used dimensions are sdmxd:refArea
and sdmxd:refPeriod
.
4.2. Multiple measures
Generally we adopt the Measure Dimension approach as, unlike the Multi-measure observations approach, this allows us to specify measure- and observation-specific attributes.
4.3. Attachment levels
We typically attach components to observations and don’t (yet) make use of the normalisation algorithm to push down attachment levels.
Alternate attachment levels [Issue #GSS-Cogs/gss-utils#191]
4.4. Data Markers
Publishers often need some way to mark observations. Sometimes this is an annotation that adds context to an observation, other times the marker replaces the value or explains why it is not there.
The RDF Data Cube model requires that every measure have a value. PMD relaxes these constraints requiring instead that a value or a data marker be present. This model was chosen to support cases where the value itself was suppressed (e.g. for confidentiality, low-reliability or because it was simply unavailable). This allows us to state that explicitly and distinguish the reason. If we instead chose not to create the observation then this could be misconstrued as meaning that the value doesn’t or won’t ever exist. It’s also compatible with existing applications (e.g. SPARQL aggregations) that expect measure values to be numbers.
You may attach a marker to an observation using the property sdmxa:obsStatus
. The marker ought to have a label, beyond that PMD has no expectations for it’s data model. The SDMX Code vocabulary provided as part of the RDF Data Cube specification includes a codelist of sdmxcode:ObsStatus
resources. The statistics.gov.scot site provides its own Data markers concept scheme.
# normal observation provides a measure value obs2015 a qb : Observation ; : count 123 . # suppressed observation provides a marker instead obs2016 a qb : Observation ; sdmxa : obsStatus : c . : c a sdmxcode : ObsStatus ; rdfs : label "Confidential" .
4.5. Coverage
Need to make progress with the Datasets' Temporal & Spatial Coverage proposal. [Issue #Swirrl/cogs-issues#92]
5. Codelists
This section will need revising in light of the discussions around dataset-specific codelists. [Issue #Swirrl/cogs-issues#263]
Typically the dimension values are enumerated using codelists. We represent these using the Simple Knowledge Organization System [skos-reference].
We use the SKOS descriptions for slicing and filtering cubes. We encourage publishers to provide richer descriptions of their reference data wherever possible.
5.1. SKOS
eg : obsUK2021 a qb : Observation ; sdmxd : refArea : UK ; sdmxd : refPeriod : 2021 ; . : UK a skos : Concept ; rdfs : label "United Kingdom" ; skos : inScheme : Geographies ; skos : notation "UK" ; ui : sortPriority 1 ; . : 2021 a skos : Concept ; rdfs : label "2021" ; skos : inScheme : Years ; ui : sortPriority 2021 ; . : Geographies a skos : ConceptScheme ; rdfs : label "Geographies" ; . : Years a skos : ConceptScheme ; rdfs : label "Years" ; .
5.1.1. Concept Schemes
Codes are grouped into codelists using skos:ConceptScheme
s.
rdfs:label
-
A human readable label. A scheme must have one label. This may match the notation if no suitable name applies. Longer descriptions should use
rdfs:comment
.
Explain that catalog entry is required if we don’t relax this as part of https://github.com/Swirrl/cogs-issues/blob/master/modelling/dataset-specific-codelists.md
If the scheme is hierarchical (it includes skos:broader
or skos:narrower
relations between concepts) then you must also provide:
skos:hasTopConcept
-
To identify the top concept or concepts in a scheme (the root node). This must be present for hierarchical schemes
5.1.2. Concepts
The codes are represented as skos:Concept
s. These must have the following properties:
rdfs:label
-
A human readable label. A concept must have one label. This may match the notation if no suitable name applies. Longer descriptions should use
rdfs:comment
. skos:inScheme
-
To relate the code to it’s codelist (concept scheme). A concept must be in at least one scheme.
Note that concepts should be grouped into skos:ConceptScheme
s and not skos:Collection
s so the relation skos:member
must not be present.
Concepts can also be described with some optional properties:
skos:notation
-
This is the coding that causes us to think of dimension-values as "codes", for example
E92000001
is the notation for England. A concept may have at most one notation. ui:sortPriority
-
A number used for sorting codes in order. At most one sort priority may be provided.
rdfs:comment
-
To describe the code in more detail. A concept may have at most one comment.
For hierarchical codelists, individual codes should be related using:
skos:narrower
-
To point to the URI of a "child" code (perhaps contained within the parent) which must also be a concept.
skos:broader
-
To point to the URI of a "parent" code which must also be a concept.
These relations form directed-acyclic-graphs. You may additionally provide:
skos:topConceptOf
-
Pointing to those schemes where the concept is the/ a root node. The object must be a scheme.
5.2. PMDKOS
We extend the SKOS vocabulary in the PMDKOS vocabulary. The PMDKOS ontology is available here.
5.2.1. Concept levels
PMDKOS introduces a pmdkos:ConceptLevel
class. These serve to identify levels in a concept scheme. This is helps to find concepts within a hierarchy without having to traverse the skos:broader
/skos:narrower
relations.
<http://gss-data.org.uk/def/geography/level/E08> rdfs : subClassOf pmdkos : ConceptLevel ; rdfs : label "E08 - Metropolitan Districts" . <http://statistics.data.gov.uk/id/statistical-geography/E08000003> a <http://gss-data.org.uk/def/geography/level/E08> ; rdfs : label "Manchester" .
The pmdkos:ConceptLevel
serves a similar purpose to xkos:ClassificationLevel
from the XKOS standard, except that it doesn’t require a single, ordered set of levels. The need arose in the context of English administrative geography where there are parallel branches in the hierarchy with different levels at the same depth (with Unitary Authorities even spanning two equivalent depths) which couldn’t be expressed with xkos:depth
or xkos:levels
.
6. Reference Data
Reference Data are those resources that are used to classify and describe other data. In our case this is typically the dimension-values of observations in cubes.
While it’s often sufficient to simply enumerate the classifications in codelists, some resources are specialised enough to have their own vocabularies and data sources. Geographies and time intervals are the most commonly examples used with PMD and are described in more detail below.
6.1. Date and Time
The UK Government reference intervals ontology provides a way to describe time periods (date ranges). The URI patterns are described here.
Where possible we recommend that a named interval (e.g. http://reference.data.gov.uk/id/week/2021-W01) be chosen instead of the equivalent gregorian interval (e.g. http://reference.data.gov.uk/id/gregorian-interval/2021-01-04/P1W). While it’s often easier to create the latter generically, named intervals have more intuitive identifiers and labels.
The descriptions of intervals may be downloaded from http://reference.data.gov.uk. Since time is infinite, it’s not possible to download all definitions at once. Instead we need to load descriptions for new intervals when they’re first used (i.e. as part of the loading pipeline). We can either download the definition or build it ourselves from the URIs (using rules).
The full description provided by http://reference.data.gov.uk is quite verbose. It’s typically only important to include:
rdfs:label
-
A human-readable name for the interval. Be careful to avoid ambiguity around e.g. years and government years.
scovo:min
andscovo:max
-
These provide a simple translation into an
xsd:date
literal which allows you to do aggregate/ sorting in SPARQL and is more commonly used in other tools (like a javascript visualisation library). interval:hasXsdDurationDescription
-
The intervals duration e.g.
"P1Y"^^xsd:duration
for years.
Explore using interval literals e.g. xsd:gYear
as an alternative.
6.2. Geography
6.2.1. ONS Geography Linked Data
For UK geography, publishers should use identifiers from ONS Geography Linked Data.
The URIs have the pattern http://statistics.data.gov.uk/id/statistical-geography/{gss_code}
where the gss_code
comes from the ONS’s Code History Database), for example Manchester has the URI http://statistics.data.gov.uk/id/statistical-geography/E08000003.
The RDF descriptions on the Geography Linked Data site include a range of characteristics such as version history and geometries (with boundaries available in WKT or GeoJSON).
One important property is http://statistics.data.gov.uk/def/statistical-geography#status
which is used when browsing geography codelists or using them to filter observations in datasets.
Use pmdkos:validUntil
instead of statgeo:status
[Issue #Swirrl/muttnik#1254]
6.2.2. Geosparql
The [geosparql] standard provides terms to describe geospatial data in RDF and functions to query it with [SPARQL-QUERY]. This is based upon the Simple Feature Access architecture from the Open Geospatial Consortium. PMD uses this vocabulary to attach boundaries (using geo:Geometry
) to geographies (example follows). This allows us to spatial functions like geof:distance
.
6.2.3. PMDGEO
PMD extends geosparql with the PMDGEO vocabulary. The PMDGEO ontology is available here.
The main reason for the extension is to provide support for GeoJSON ([rfc7946]) since the geosparql standard itself only supports Well-known Text (WKT) and Keyhole Markup Language (KML) serialisations. We introduce pmdgeo:asGeoJSON
as a sub-property of geo:hasSerialization
equivalent to geo:asWKT
/geo:asKML
and the datatype pmdgeo:geoJsonLiteral
as an equivalent to geo:wktLiteral
/ geokmlLiteral
.
We also extend the standard with the property pmdgeo:simplificationPercent
which you can use to specify the percentage simplification that was applied to the source to derive a geometry (100
would mean no simplification was applied). This provides a simple way to distinguish a precise and generalised geometry for a given spatial feature.
<http://statistics.data.gov.uk/id/statistical-geography/E00000001> geo : hasGeometry <http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/100> , <http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/25> . <http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/100> a geosparql : Geometry ; pmdgeo : asGeoJSON "{\"type\":\"Feature\",\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-0.0945,51.51976],[-0.09439,51.52067],[-0.09477,51.52059],[-0.09527,51.5205],[-0.09652,51.52027],[-0.09651,51.52024],[-0.09605,51.52033],[-0.09579,51.52007],[-0.0945,51.51976]]]},\"properties\":{\"geography_uri\":\"http://statistics.data.gov.uk/id/statistical-geography/E00000001\"},\"id\":\"http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/100\"}" ^^ pmdgeo : geoJsonLiteral ; pmdgeo : simplificationPercent 100 ; geosparql : asWKT "POLYGON ((-0.0945 51.51976, -0.09439 51.52067, -0.09477 51.52059, -0.09527 51.5205, -0.09652 51.52027, -0.09651 51.52024, -0.09605 51.52033, -0.09579 51.52007, -0.0945 51.51976))" ^^ geosparql : wktLiteral . <http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/25> a geosparql : Geometry ; pmdgeo : asGeoJSON "{\"type\":\"Feature\",\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-0.0945,51.51976],[-0.09439,51.52067],[-0.09527,51.5205],[-0.09652,51.52027],[-0.09651,51.52024],[-0.09579,51.52007],[-0.0945,51.51976]]]},\"properties\":{\"geography_uri\":\"http://statistics.data.gov.uk/id/statistical-geography/E00000001\"},\"id\":\"http://statistics.data.gov.uk/id/statistical-geography/E00000001/geometry/25\"}" ^^ pmdgeo : geoJsonLiteral ; pmdgeo : simplificationPercent 25 ; geosparql : asWKT "POLYGON ((-0.0945 51.51976, -0.09439 51.52067, -0.09527 51.5205, -0.09652 51.52027, -0.09651 51.52024, -0.09579 51.52007, -0.0945 51.51976))" ^^ geosparql : wktLiteral .
It would be nice to describe the level of generalisation in absolute terms (e.g. metres). [Issue #Swirrl/ons-geo-graft#83]