Creating LDF Resources • ldf

This vignette describes how to create ldf resources from SPARQL queries or RDF files. You might like to first read the Working with LDF Resources vignette to understand how RDF resources are represented in LDF.

library(ldf)

Downloading Resources with SPARQL

You can create resources by downloading a table of descriptions with a SPARQL SELECT query.

As an example, lets download some music genres from dbpedia. This query will find 100 things, identified by their uri that are music genres, along with their label and a descriptive comment. We’ll look for the English version of the latter two strings.

music_genres_query <- "
PREFIX : <http://dbpedia.org/ontology/>

SELECT *
WHERE {
  ?uri a :MusicGenre;
    rdfs:label ?label;
    rdfs:comment ?comment
    .
  FILTER langMatches(lang(?label), 'EN')
  FILTER langMatches(lang(?comment), 'EN')
} LIMIT 100
"

We can use the query() function to execute the query and parse the results:

music_genre_results <- query(music_genres_query, endpoint="http://dbpedia.org/sparql/")

This is what the first few results look like:

head(music_genre_results)
#> # A tibble: 6 x 3
#>   uri                       label       comment                                 
#>   <chr>                     <chr>       <chr>                                   
#> 1 http://dbpedia.org/resou… Art rock    "Art rock is a subgenre of rock music t…
#> 2 http://dbpedia.org/resou… Bebop       "Bebop or bop is a style of jazz develo…
#> 3 http://dbpedia.org/resou… Britpop     "Britpop was a UK based music and cultu…
#> 4 http://dbpedia.org/resou… Bubblegum … "Bubblegum pop (also known as bubblegum…
#> 5 http://dbpedia.org/resou… Fighting g… "A fighting game is a video game in whi…
#> 6 http://dbpedia.org/resou… Free impro… "Free improvisation or free music is im…

We can then create resources for these:

music_genres <- resource(music_genre_results$uri, description=music_genre_results)

Which we can then manipulate within R:

# find music genres where the description mentions "dance"
music_genres[grep("dance", property(music_genres, "comment"))]
#> <ldf_resource[12]>
#>  [1] Polka                          Trance music                  
#>  [3] Vaudeville                     Zarzuela                      
#>  [5] Afro/Cosmic music              Benga music                   
#>  [7] Bubblegum dance                Waltz (International Standard)
#>  [9] Logobi                         Sega (genre)                  
#> [11] K-pop                          Western swing                 
#> Description: uri, label, comment

Reading RDF files with rdflib

We can create resources from serialised RDF files too.

To read RDF into R we can use the rdflib package. This in turn uses the redland package to provide bindings to the C library of the same name, and the jsonld package for JSON-LD serialisations.

Let’s load up an example from that package.

library(rdflib)

article_rdf <- rdf_parse(system.file("extdata", "ex.xml", package="rdflib", mustWork=TRUE))

This creates a list containing pointers to a redland world and model objects. We can take a peak at the statements with rdflib::print.rdf() (this serialises the data again and prints back the result):

print(article_rdf, format="turtle")
#> Total of 35 triples, stored in hashes
#> -------------------------------
#> @base <localhost://> .
#> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#> 
#> <http://id.crossref.org/contributor/benjamin-l-phillips-2etprmps2zm1a>
#>     a <http://xmlns.com/foaf/0.1/Person> ;
#>     <http://xmlns.com/foaf/0.1/familyName> "Phillips" ;
#>     <http://xmlns.com/foaf/0.1/givenName> "Benjamin L." ;
#>     <http://xmlns.com/foaf/0.1/name> "Benjamin L. Phillips" .
#> 
#> <http://id.crossref.org/contributor/carl-boettiger-2etprmps2zm1a>
#> 
#> ... with 25 more triples

The contents is too big to display here, but you can see from the rdf file itself, that the data describes a journal article: https://doi.org/10.1002/ece3.2314.

The description is a graph, not a table. It’s not a tidy collection of similarly shaped objects. We’ve got the article itself, and nested descriptions of related resources.

We can identify the different resource types with a query:

rdf_query(article_rdf, "SELECT * WHERE { ?s a ?type }")
#> # A tibble: 4 x 2
#>   s                                                  type                       
#>   <chr>                                              <chr>                      
#> 1 http://id.crossref.org/contributor/t-alex-perkins… http://xmlns.com/foaf/0.1/…
#> 2 http://id.crossref.org/contributor/benjamin-l-phi… http://xmlns.com/foaf/0.1/…
#> 3 http://id.crossref.org/contributor/carl-boettiger… http://xmlns.com/foaf/0.1/…
#> 4 http://id.crossref.org/issn/2045-7758              http://purl.org/ontology/b…

Here we can see the description also includes the journal in which the article is published and the creators.

We could gather all of these entities into a single resource vector, but the descriptions wouldn’t overlap. The creators don’t have prism:issn identifiers and the journal doesn’t have a foaf:familyName.

Instead it makes more sense to split these entities into separate vectors. We’ll focus on the creators, since there are several of them. We can do this with a query:

creators_triples <- rdf_query(article_rdf, "SELECT * WHERE { ?s a <http://xmlns.com/foaf/0.1/Person>; ?p ?o }")

Now we have a table of statements about the creators:

creators_triples
#> # A tibble: 12 x 3
#>    s                                  p                        o                
#>    <chr>                              <chr>                    <chr>            
#>  1 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Boettiger        
#>  2 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Carl             
#>  3 http://id.crossref.org/contributo… http://www.w3.org/1999/… http://xmlns.com…
#>  4 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Carl Boettiger   
#>  5 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Perkins          
#>  6 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… T. Alex Perkins  
#>  7 http://id.crossref.org/contributo… http://www.w3.org/1999/… http://xmlns.com…
#>  8 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… T. Alex          
#>  9 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Benjamin L. Phil…
#> 10 http://id.crossref.org/contributo… http://www.w3.org/1999/… http://xmlns.com…
#> 11 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Benjamin L.      
#> 12 http://id.crossref.org/contributo… http://xmlns.com/foaf/0… Phillips

We can tabulate these statements into a tidy data frame with one row per creator, and one column per property.

library(tidyr)

(creators_description <- creators_triples %>% 
  spread("p","o"))
#> # A tibble: 3 x 5
#>   s        `http://www.w3.or… `http://xmlns.c… `http://xmlns.c… `http://xmlns.c…
#>   <chr>    <chr>              <chr>            <chr>            <chr>           
#> 1 http://… http://xmlns.com/… Phillips         Benjamin L.      Benjamin L. Phi…
#> 2 http://… http://xmlns.com/… Boettiger        Carl             Carl Boettiger  
#> 3 http://… http://xmlns.com/… Perkins          T. Alex          T. Alex Perkins

We could proceed using the full URIs as properties, but it’s nicer to replace these with shorter strings that don’t need escaping with backticks:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

(creators_description <- creators_description %>% 
  rename(uri=s,
         type=`http://www.w3.org/1999/02/22-rdf-syntax-ns#type`,
         family_name=`http://xmlns.com/foaf/0.1/familyName`,
         given_name=`http://xmlns.com/foaf/0.1/givenName`,
         name=`http://xmlns.com/foaf/0.1/name`))
#> # A tibble: 3 x 5
#>   uri                          type            family_name given_name name      
#>   <chr>                        <chr>           <chr>       <chr>      <chr>     
#> 1 http://id.crossref.org/cont… http://xmlns.c… Phillips    Benjamin … Benjamin …
#> 2 http://id.crossref.org/cont… http://xmlns.c… Boettiger   Carl       Carl Boet…
#> 3 http://id.crossref.org/cont… http://xmlns.c… Perkins     T. Alex    T. Alex P…

We could have done the same transformation within the select query:

describe_creator = "
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE { 
  ?uri a <http://xmlns.com/foaf/0.1/Person>;
    a ?type;
    foaf:familyName ?family_name;
    foaf:givenName ?given_name;
    foaf:name ?name;
    .
}
"

(creators_description <- rdf_query(article_rdf, describe_creator))
#> # A tibble: 3 x 5
#>   uri                          type            family_name given_name name      
#>   <chr>                        <chr>           <chr>       <chr>      <chr>     
#> 1 http://id.crossref.org/cont… http://xmlns.c… Boettiger   Carl       Carl Boet…
#> 2 http://id.crossref.org/cont… http://xmlns.c… Perkins     T. Alex    T. Alex P…
#> 3 http://id.crossref.org/cont… http://xmlns.c… Phillips    Benjamin … Benjamin …

We can then use this to create the resource vector:

(creators <- resource(creators_description$uri, creators_description))
#> <ldf_resource[3]>
#> [1] http://id.crossref.org/contributor/carl-boettiger-2etprmps2zm1a     
#> [2] http://id.crossref.org/contributor/t-alex-perkins-2etprmps2zm1a     
#> [3] http://id.crossref.org/contributor/benjamin-l-phillips-2etprmps2zm1a
#> Description: uri, type, family_name, given_name, name