GeoVectors – a Linked Open Corpus of OpenStreetMap Embeddings

The GeoVectors corpus is a comprehensive large-scale linked open corpus of OpenStreetMap entity embeddings that provides latent representations of over 980 million entities. The GeoVectors capture the semantic and geographic dimensions of OpenStreetMap entities and make them directly accessible to machine learning applications. The "Tags" datasets provide embeddings that capture the semantic dimension of OpenStreetMap entities. The "Location" datasets provide the geographic dimension.

The embeddings are provided in the tab-separated values (tsv) format. Each row contains the embedding of a single OpenStreetMap entity. The first column contains the OpenStreetMap type and the second column contains the OpenStreetMap ID of the respective entity. The type can either be node (n), way (w), or relation (r). The remaining columns represent the dimensions of the embedding space.


10.5281/zenodo.4964300 / Dataset VoID description

GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale

Nicolas Tempelmeier, Simon Gottschalk and Elena Demidova. GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale.
30th ACM International Conference on Information and Knowledge Management (CIKM), 2021.

Cite as

@inproceedings{tempelmeier2021geovectors,
   title={{GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale}},
   author={Tempelmeier, Nicolas and Gottschalk, Simon and Demidova, Elena},
   booktitle={Proceedings of the 30th ACM International Conference on Information & Knowledge Management},
   year={2021}
}

GeoVectors Embeddings

We provide the GeoVectors datasets in region-specific partitions in the tables listed here.

GeoVectors Knowledge Graph

We provide a knowledge graph that provides links from the GeoVectors entities to OpenStreetMap, Wikidata, Wikipedia and DBpedia.

Schema
Schema of the GeoVectors knowledge graph

We provide a query endpoint and a SPARQL endpoint to find the right dataset for particular entities.

Example

The following example shows how to retrieve the geographic GeoVectors embedding vector of an entity (the city Oslo) by its Wikidata ID (Q585).

  1. Identify the entity using the SPARQL endpoint (results).
    SELECT ?entity
    WHERE {
      ?entity owl:sameAs wd:Q585 .
    }
    Alternatively, you can use the query endpoint to search for the label "Oslo".
    We take as result the GeoVectors entity "v1_n_20981158".
  2. Identify the dataset DOI, ID and type of the entity using the SPARQL endpoint (results).
    SELECT (STR(?id) AS ?id) ?type ?doi WHERE {
      geovec:v1_n_20981158 dcterms:isPartOf ?doi .
      geovec:v1_n_20981158 dcterms:identifier ?id .
      geovec:v1_n_20981158 rdf:type ?type .
      FILTER(?type = lgd:Node || ?type = lgd:Way || ?type = lgd:Relation) .
    }
    We take as result the dataset DOI 10.5281/zenodo.4957583 that contains geographic embeddings of entities in West Europe, the type "Node" and the ID "20981158".
  3. Extract the GeoVectors embeddings.
    Download the file norway-location.tsv.gz from the identified dataset. Read that file using a CSV parser and search for the row which has "n" (type) in its first column and "20981158" (ID) in its second column. The remaining columns represent the entity's geographic embedding.

Funding

This work was partially funded by DFG, German Research Foundation (“WorldKG", DE 2299/2-1), the Federal Ministry of Education and Research (BMBF), Germany (“Simple-ML", 01IS18054), the Federal Ministry for Economic Affairs and Energy (BMWi), Germany (“d-E-mand", 01ME19009B), and the European Commission (EU H2020, “smashHit", grant-ID 871477).