Show HN: My demo for vector embeddings for the Earth's surface

(louisquissetlabs.com)

106 points | by ckrapu 601 days ago

9 comments

ckrapu 601 days ago
I've had to build out some version of a geospatial vector embedding / latent variable dataset for at least 4 separate projects now. Come see the viewer I've built on top of it!
The embeddings come from globally available Copernicus land cover data.
[-]
- fnordpiglet 599 days ago
  Can you explain what I’m looking at? I don’t know how to interpret the hex tiles :-)
  [-]
  - ckrapu 599 days ago
    Sure! The basic idea is that each hexagon is a discrete unit of space for which I obtain a vector embedding. This vector is supposed to represent a sort of data-based summary of that location, obtained in this case using deep learning.
    When you put the search on a hex, it looks up the vector for that hex and then performs a similarity search on all other vectors within the circle and shows the ones which are most similar in terms of land cover. The dependence on land cover / land use data is just because that was easy to get.
    As other folks have pointed out here, raw satellite imagery is also a potential input source for this. I'm playing around with other sources and really want to integrate something like GeoVex (https://openreview.net/forum?id=7bvWopYY1H) into the embeddings as well.
    [-]
    - jebarker 599 days ago
      Would it provide useful/interesting results if the similarity search was global? E.g. find me neighborhoods in London most similar to this one in Chicago?
    - IngWoden 598 days ago
      [dead]
  - wyldfire 599 days ago
    I'm pretty sure I'm not the intended audience but I also have no idea what this is used for. Surveying? Real estate tycoons? Oil & gas exploration?
    [-]
    - potatoman22 599 days ago
      It's a way to encode land to make predictions of it. E.g. is the land arable, is it rural, how similar is it to X, etc. Embeddings help encode data in formats more usable by ML models.
      [-]
      - lovasoa 599 days ago
        The question was: in what context do people need to answer a question like "which geographical points are close to X and similar to X"?
        I don't understand who the target audience is and what this can be used for.
        [-]
        ckrapu 599 days ago
        The original idea came from something I saw at work - we needed a way to build generic feature sets representing something about real estate, but beyond the data we had on prices, floors, and other house-specific details.
        potatoman22 597 days ago
        My guess is this site is simply a way to explore the embeddings. People make similar data visualization tools for word embeddings, so that's what I assumed this was.
      - wyldfire 599 days ago
        Sure, I get that part -- but then how do people use the predictions?
        [-]
        foota 599 days ago
        The embeddings are used by algorithms, not people, generally. You could ask something like "what's the most similar place to X within Y", and it would using the embeddings (which cover a variety of facts) to calculate answer. An embedding is an N dimensional vector (where the dimensions may or may not be meaningful to us), and similarity can be implemented by looking at the similarity between vectors.
        [-]
        ckrapu 599 days ago
        Yup, and while the similarity search is perhaps the most visually appealing way to work with it, the real use (in my opinion) is in providing generic sets of geospatial features which are reusable across applications. I've built out versions of H3-referenced feature sets at each of the jobs I've had over the last 10 years.
  - tartakovsky 599 days ago
    Great question. A legend or brief description of the underlying logic / heuristic would be helpful.
    [-]
    - breckenedge 599 days ago
      The heuristic is likely the result of an ML algorithm, so the underlying logic may not make much sense to us.
- nonameiguess 598 days ago
  Looks like Copernicus updates yearly? I can't tell if they include elevation from the "technical" tab on their home page.
  Having originally come from the world of geointelligence, let me tell you this is not an easy problem to solve. For rural land use, this is probably fairly reliable, but depending on the granularity of change detection you want, cities are often building new neighborhoods in the span of months, large construction projects finish, human movement happens more in the span of hours or even minutes, and that's just for land. If you want maritime tracking, you need nearly continuous updates. We managed to do it for the Navy, but the infrastructure required for this is immense, much of the sensor technology is classified and not even available for commercial use, and the resource requirements not remotely practical for a personal side project.
  Of course, military intelligence is primarily trying to track the land use of other militaries, especially in active theaters of operations, and that changes even more frequently than regular places where people aren't constantly erecting and moving temporary headquarters, living under camouflage cover, and blowing up existing infrastructure.
  I guess you're doing this for peacetime domestic real estate, like neighborhood X in city Y is similarity ranked against neighborhood U in city V? Are you incorporating pricing and demographic data or just land use? It seems to me like neighbors make the neighborhood, as much or more than qualities of the land. Along with things like usability of the sidewalks, responsiveness and level of disrepair of the roads, crime rates, level of visible homelessness, air quality, vehicular traffic congestion.
  I don't want to shit on the approach too much. Usefulness is determined by the results you get, but given the heterogeneity of the data here, some of it ordinal, some of it nominal, discrete versus continuous, irreconciability of scaling and dimensional analysis, not necessarily coming from similar distributions if you tried to just z-score it all, I can think of ways using pure numerical voodoo to put them all into the same vector space, but the statistical validity of doing this is dubious at best.
- spousty 599 days ago
  How did you generate the embeddings. The vectors are relatively small for all the embedding I have seen built from image and nlp models.
  Which copernicus bands were you using? Did you augment the data with DEM info?
  [-]
  - ckrapu 599 days ago
    The embeddings were obtained using a CNN triplet loss model (~10M parameters) on the Copernicus land cover data. I haven't used DEM data yet but I have done generative modeling on DEMs in other work and would like to do that too:
    https://www.linkedin.com/in/christopher-krapu/overlay/157690...
throwaway743 599 days ago
Dude, please provide context on the site. I have no clue what I'm looking at or its purpose. Not trying to poo poo on it, just want context.
[-]
- breckenedge 599 days ago
  It’s highlighting similar areas to the area currently under the cursor.
  [-]
  - ZoomerCretin 599 days ago
    Similar how? Geographically? Climate? Population?
    [-]
    - breckenedge 598 days ago
      Depends on the data fed to the model. Probably all of the above and many more.
- ckrapu 599 days ago
  Sorry! The presentation could be better. I'll work on the FAQ.
1024core 599 days ago
Moved the center to SF and I've been sitting, watching the spinner.
Some documentation would be helpful.
[-]
- crubier 599 days ago
  I've seen the same thing, querying SF hangs for some reason. And so does Cascais in Portugal. It works in San Mateo and Lisbon though
  [-]
  - ckrapu 599 days ago
    I think I can see what's going on here - I used a shapefile with the boundaries of the world's countries (and their coastlines) which had some geometric simplification applied to it. This file was used to mask out any water (for which the embedding model won't do much), but I think that the simplification process snapped the coastline too far inland, leaving some points on land which were masked out erroneously.
skygazer 599 days ago
This tool looks very interesting, and seems to work well, but being utterly unfamiliar with geospatial vector embeddings, their purpose or use, I had no idea what I was looking at, or why.
It seems to show areas of similarity, within a radius of a central query location, with regard to (perhaps) vegetation cover (e.g., forests, grasslands, wetlands), artificial surfaces (e.g., urban areas, roads), agricultural areas, water bodies, etc, overlayed on Google Maps, and allows exporting of the embeddings for lat/lons as cvs. It looks like land features for hexagonal grid areas have been turned into points in a 15 dimensional space, and some sort of nearest-neighbor search is done to return most similar other grid areas within the larger area. It does indeed seem accurate in my area!
I'm not sure what this would be useful for, but I'm assuming urban planning, real estate, agriculture or conservation? I know I'm not the target audience, but more info or ideas would be fascinating.
[-]
- ckrapu 599 days ago
  You pretty much hit the nail on the head. The application areas you mentioned are the same as the ones that I had in mind when developing this.
  [-]
  - mistrial9 599 days ago
    this is more like math "eyecandy" in the present state
    source: professional urban planning in California
DerSaidin 599 days ago
Seems to not handle the ocean well.
[-]
- spousty 599 days ago
  It's due to the fact that they used satellite imagery to create the embeddings. The map is just for visualization. They probably used 5 or more bands of the satellite data which means each pixel is going to be slightly different due to things like depth, amount of silt in the water, amount of plankton....
  Having worked on these types of problems before the model is doing a pretty great job matching pixels.
  [-]
  - ckrapu 599 days ago
    Thanks! And you are giving it too much credit here - it's just trained on one-hot encoded land cover (24 classes) from Copernicus. Using imagery directly would be # 2 on my list of to-dos after including elevation in the input data.
    [-]
    - spousty 599 days ago
      Oh so did you run the CNN on pre-classified data
- ckrapu 599 days ago
  I intentionally avoided using lots of ocean areas - this way I cut down the number of required sites for inference from ~100 million (at resolution 7 in the H3 system) to around 25 million.
dlnovell 599 days ago
Chris - just saw your presentation of this at PNNL, awesome seeing it pop up on HN too!
[-]
- ckrapu 599 days ago
  Cool! Glad you got to see it working and that presentation was a nice reason to make sure everything was cleaned up.
  [-]
  - bobeartoes 597 days ago
    Is the presentation of this work available online? Would love to watch or read through!
aaomidi 599 days ago
This is amazing!
watersb 600 days ago
Very nice!