I've had to build out some version of a geospatial vector embedding / latent variable dataset for at least 4 separate projects now. Come see the viewer I've built on top of it!
The embeddings come from globally available Copernicus land cover data.
Sure! The basic idea is that each hexagon is a discrete unit of space for which I obtain a vector embedding. This vector is supposed to represent a sort of data-based summary of that location, obtained in this case using deep learning.
When you put the search on a hex, it looks up the vector for that hex and then performs a similarity search on all other vectors within the circle and shows the ones which are most similar in terms of land cover. The dependence on land cover / land use data is just because that was easy to get.
As other folks have pointed out here, raw satellite imagery is also a potential input source for this. I'm playing around with other sources and really want to integrate something like GeoVex (https://openreview.net/forum?id=7bvWopYY1H) into the embeddings as well.
Would it provide useful/interesting results if the similarity search was global? E.g. find me neighborhoods in London most similar to this one in Chicago?
It's a way to encode land to make predictions of it. E.g. is the land arable, is it rural, how similar is it to X, etc. Embeddings help encode data in formats more usable by ML models.
The original idea came from something I saw at work - we needed a way to build generic feature sets representing something about real estate, but beyond the data we had on prices, floors, and other house-specific details.
My guess is this site is simply a way to explore the embeddings. People make similar data visualization tools for word embeddings, so that's what I assumed this was.
The embeddings are used by algorithms, not people, generally. You could ask something like "what's the most similar place to X within Y", and it would using the embeddings (which cover a variety of facts) to calculate answer. An embedding is an N dimensional vector (where the dimensions may or may not be meaningful to us), and similarity can be implemented by looking at the similarity between vectors.
Yup, and while the similarity search is perhaps the most visually appealing way to work with it, the real use (in my opinion) is in providing generic sets of geospatial features which are reusable across applications. I've built out versions of H3-referenced feature sets at each of the jobs I've had over the last 10 years.
Looks like Copernicus updates yearly? I can't tell if they include elevation from the "technical" tab on their home page.
Having originally come from the world of geointelligence, let me tell you this is not an easy problem to solve. For rural land use, this is probably fairly reliable, but depending on the granularity of change detection you want, cities are often building new neighborhoods in the span of months, large construction projects finish, human movement happens more in the span of hours or even minutes, and that's just for land. If you want maritime tracking, you need nearly continuous updates. We managed to do it for the Navy, but the infrastructure required for this is immense, much of the sensor technology is classified and not even available for commercial use, and the resource requirements not remotely practical for a personal side project.
Of course, military intelligence is primarily trying to track the land use of other militaries, especially in active theaters of operations, and that changes even more frequently than regular places where people aren't constantly erecting and moving temporary headquarters, living under camouflage cover, and blowing up existing infrastructure.
I guess you're doing this for peacetime domestic real estate, like neighborhood X in city Y is similarity ranked against neighborhood U in city V? Are you incorporating pricing and demographic data or just land use? It seems to me like neighbors make the neighborhood, as much or more than qualities of the land. Along with things like usability of the sidewalks, responsiveness and level of disrepair of the roads, crime rates, level of visible homelessness, air quality, vehicular traffic congestion.
I don't want to shit on the approach too much. Usefulness is determined by the results you get, but given the heterogeneity of the data here, some of it ordinal, some of it nominal, discrete versus continuous, irreconciability of scaling and dimensional analysis, not necessarily coming from similar distributions if you tried to just z-score it all, I can think of ways using pure numerical voodoo to put them all into the same vector space, but the statistical validity of doing this is dubious at best.
The embeddings were obtained using a CNN triplet loss model (~10M parameters) on the Copernicus land cover data. I haven't used DEM data yet but I have done generative modeling on DEMs in other work and would like to do that too:
I think I can see what's going on here - I used a shapefile with the boundaries of the world's countries (and their coastlines) which had some geometric simplification applied to it. This file was used to mask out any water (for which the embedding model won't do much), but I think that the simplification process snapped the coastline too far inland, leaving some points on land which were masked out erroneously.
This tool looks very interesting, and seems to work well, but being utterly unfamiliar with geospatial vector embeddings, their purpose or use, I had no idea what I was looking at, or why.
It seems to show areas of similarity, within a radius of a central query location, with regard to (perhaps) vegetation cover (e.g., forests, grasslands, wetlands), artificial surfaces (e.g., urban areas, roads), agricultural areas, water bodies, etc, overlayed on Google Maps, and allows exporting of the embeddings for lat/lons as cvs. It looks like land features for hexagonal grid areas have been turned into points in a 15 dimensional space, and some sort of nearest-neighbor search is done to return most similar other grid areas within the larger area. It does indeed seem accurate in my area!
I'm not sure what this would be useful for, but I'm assuming urban planning, real estate, agriculture or conservation? I know I'm not the target audience, but more info or ideas would be fascinating.
It's due to the fact that they used satellite imagery to create the embeddings. The map is just for visualization. They probably used 5 or more bands of the satellite data which means each pixel is going to be slightly different due to things like depth, amount of silt in the water, amount of plankton....
Having worked on these types of problems before the model is doing a pretty great job matching pixels.
Thanks! And you are giving it too much credit here - it's just trained on one-hot encoded land cover (24 classes) from Copernicus. Using imagery directly would be # 2 on my list of to-dos after including elevation in the input data.
I intentionally avoided using lots of ocean areas - this way I cut down the number of required sites for inference from ~100 million (at resolution 7 in the H3 system) to around 25 million.
The embeddings come from globally available Copernicus land cover data.
When you put the search on a hex, it looks up the vector for that hex and then performs a similarity search on all other vectors within the circle and shows the ones which are most similar in terms of land cover. The dependence on land cover / land use data is just because that was easy to get.
As other folks have pointed out here, raw satellite imagery is also a potential input source for this. I'm playing around with other sources and really want to integrate something like GeoVex (https://openreview.net/forum?id=7bvWopYY1H) into the embeddings as well.
I don't understand who the target audience is and what this can be used for.
Having originally come from the world of geointelligence, let me tell you this is not an easy problem to solve. For rural land use, this is probably fairly reliable, but depending on the granularity of change detection you want, cities are often building new neighborhoods in the span of months, large construction projects finish, human movement happens more in the span of hours or even minutes, and that's just for land. If you want maritime tracking, you need nearly continuous updates. We managed to do it for the Navy, but the infrastructure required for this is immense, much of the sensor technology is classified and not even available for commercial use, and the resource requirements not remotely practical for a personal side project.
Of course, military intelligence is primarily trying to track the land use of other militaries, especially in active theaters of operations, and that changes even more frequently than regular places where people aren't constantly erecting and moving temporary headquarters, living under camouflage cover, and blowing up existing infrastructure.
I guess you're doing this for peacetime domestic real estate, like neighborhood X in city Y is similarity ranked against neighborhood U in city V? Are you incorporating pricing and demographic data or just land use? It seems to me like neighbors make the neighborhood, as much or more than qualities of the land. Along with things like usability of the sidewalks, responsiveness and level of disrepair of the roads, crime rates, level of visible homelessness, air quality, vehicular traffic congestion.
I don't want to shit on the approach too much. Usefulness is determined by the results you get, but given the heterogeneity of the data here, some of it ordinal, some of it nominal, discrete versus continuous, irreconciability of scaling and dimensional analysis, not necessarily coming from similar distributions if you tried to just z-score it all, I can think of ways using pure numerical voodoo to put them all into the same vector space, but the statistical validity of doing this is dubious at best.
Which copernicus bands were you using? Did you augment the data with DEM info?
https://www.linkedin.com/in/christopher-krapu/overlay/157690...
Some documentation would be helpful.
It seems to show areas of similarity, within a radius of a central query location, with regard to (perhaps) vegetation cover (e.g., forests, grasslands, wetlands), artificial surfaces (e.g., urban areas, roads), agricultural areas, water bodies, etc, overlayed on Google Maps, and allows exporting of the embeddings for lat/lons as cvs. It looks like land features for hexagonal grid areas have been turned into points in a 15 dimensional space, and some sort of nearest-neighbor search is done to return most similar other grid areas within the larger area. It does indeed seem accurate in my area!
I'm not sure what this would be useful for, but I'm assuming urban planning, real estate, agriculture or conservation? I know I'm not the target audience, but more info or ideas would be fascinating.
source: professional urban planning in California
Having worked on these types of problems before the model is doing a pretty great job matching pixels.