The GeoVectors corpus is a comprehensive large-scale linked open corpus of OpenStreetMap entity embeddings that provides latent representations of over 980 million entities. The GeoVectors capture the semantic and geographic dimensions of OpenStreetMap entities and make them directly accessible to machine learning applications. The "Tags" datasets provide embeddings that capture the semantic dimension of OpenStreetMap entities. The "Location" datasets provide the geographic dimension.
The embeddings are provided in the tab-separated values (tsv) format. Each row contains the embedding of a single OpenStreetMap entity. The first column contains the OpenStreetMap type and the second column contains the OpenStreetMap ID of the respective entity. The type can either be node (n), way (w), or relation (r). The remaining columns represent the dimensions of the embedding space.
/ Dataset VoID description
GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale
Nicolas Tempelmeier, Simon Gottschalk and Elena Demidova. GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale.30th ACM International Conference on Information and Knowledge Management (CIKM), 2021.
Cite as
title={{GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale}},
author={Tempelmeier, Nicolas and Gottschalk, Simon and Demidova, Elena},
booktitle={Proceedings of the 30th ACM International Conference on Information & Knowledge Management},
year={2021}
}
GeoVectors Embeddings
We provide the GeoVectors datasets in region-specific partitions in the tables listed here.
GeoVectors Knowledge Graph
We provide a knowledge graph that provides links from the GeoVectors entities to OpenStreetMap, Wikidata, Wikipedia and DBpedia.
We provide a query endpoint and a SPARQL endpoint to find the right dataset for particular entities.
Example
The following example shows how to retrieve the geographic GeoVectors embedding vector of an entity (the city Oslo) by its Wikidata ID (Q585).
- Identify the entity using the SPARQL endpoint (results).
SELECT ?entityAlternatively, you can use the query endpoint to search for the label "Oslo".
WHERE {
?entity owl:sameAs wd:Q585 .
}
We take as result the GeoVectors entity "v1_n_20981158". -
Identify the dataset DOI, ID and type of the entity using the SPARQL endpoint (results).
SELECT (STR(?id) AS ?id) ?type ?doi WHERE {We take as result the dataset DOI 10.5281/zenodo.4957583 that contains geographic embeddings of entities in West Europe, the type "Node" and the ID "20981158".
geovec:v1_n_20981158 dcterms:isPartOf ?doi .
geovec:v1_n_20981158 dcterms:identifier ?id .
geovec:v1_n_20981158 rdf:type ?type .
FILTER(?type = lgd:Node || ?type = lgd:Way || ?type = lgd:Relation) .
} -
Extract the GeoVectors embeddings.
Download the file norway-location.tsv.gz from the identified dataset. Read that file using a CSV parser and search for the row which has "n" (type) in its first column and "20981158" (ID) in its second column. The remaining columns represent the entity's geographic embedding.
Funding
This work was partially funded by DFG, German Research Foundation (“WorldKG", DE 2299/2-1), the Federal Ministry of Education and Research (BMBF), Germany (“Simple-ML", 01IS18054), the Federal Ministry for Economic Affairs and Energy (BMWi), Germany (“d-E-mand", 01ME19009B), and the European Commission (EU H2020, “smashHit", grant-ID 871477).