Zero-1-to-3: Zero-shot One Image to 3D Object

(cs.columbia.edu)

530 points | by GaggiX 784 days ago

23 comments

HopenHeyHi 784 days ago
3D reconstruction from a single image. They stress the examples are not curated, appears to... well, gosh darnit, it appears to work.
If it runs fast enough I wonder whether one could just drive around with a webcam and generate these 3d models on the fly and even import them into a sort of GTA type simulation/game engine in real time. (To generate a novel view, Zero-1-to-3 takes only 2 seconds on an RTX A6000 GPU)
```
  This research is based on work partially supported by: 
  
  - Toyota Research Institute
  - DARPA MCS program under Federal Agreement No. N660011924032
  - NSF NRI Award #1925157
```
Oh, huh. Interesting.
```
  Future Work

  From objects to scenes: 
  Generalization to scenes with complex backgrounds remains an important challenge for our method.
  
  From scenes to videos: 
  Being able to reason about geometry of dynamic scenes from a single view would open novel research directions -- 
  such as understanding occlusions and dynamic object manipulation.

  A few approaches for diffusion-based video generation have been proposed recently and extending them to 3D would be key to opening up these opportunities.
```
[-]
- TylerE 784 days ago
  Seems like there is a bit of a gap between “runs at 0.5 fps on a $7000 workstation-grade GPU with 48GB of VRAM” and consumer applications.
  With the fairly shallow slope of the GPU performance curve overtime, I don’t see them just Moores Lawing out of it either. this would need two, maybe three orders of magnitude more performance.
  [-]
  - HopenHeyHi 784 days ago
    Of course there is a gap. This is at the exploratory proof of concept stage. The fact that it works at all is what is interesting.
    Furthermore once you've identified the make and model of the car, its relative position in 3d, any anomalies -- that ain't just a Ford pickup, it is loaded with cargo that overhangs in a particular way -- its velocity, 'etc -- I'm quite sure that extrapolating additional information from the subsequent frames will be significantly cheaper as you don't have to generate a 3d model from scratch each time.
    I think this is a viable exploratory path forward.
```
    Make it work <- you are here
    Make it work correctly
    Make it work fast
```
    Edit: Scotty does know ;)
    [-]
    - scotty79 783 days ago
      I prefer:
      Make it work <- you are here Make it work correctly Make it work fast
      [-]
  - jiggawatts 784 days ago
    Computer power goes up exponentially thanks to Moore's law. Sprinkle some software optimisations on top, and it's conceivable for that to be running at interactive framerates on consumer GPUs within 5-10 years.
    [-]
    - nitwit005 783 days ago
      > Computer power goes up exponentially thanks to Moore's law
      If you look at a graph, that stopped being true well over a decade ago.
      [-]
      - jiggawatts 782 days ago
        Only for single-threaded programs. Multi-threaded performance continued along the curve unabated.
        [-]
        nitwit005 782 days ago
        Moore's law is specifically about the number (density) of transistors in an integrated circuit.
        You could always get as much parallelism as you wanted by adding more chips.
    - ffitch 784 days ago
      the processing may as well shift to the clouds. With the subscription fee, of course : )
      [-]
      - TylerE 784 days ago
        Until we break the speed of light, I’m very bearish on cloud gaming. It just feels so bad. You’ve got like 9 layers of latency between you and the screen.
        [-]
        jimmySixDOF 783 days ago
        One possible definition of Edge Compute is GPU capacity at every last mile POP
        [-]
        jlokier 783 days ago
        I agree, though my last mile latency to the nearest POP is about 85ms. Still a bit on the high side for action games compared with playing locally.
        [-]
        kijiki 783 days ago
        85ms, holy crap, who is your ISP?
        On Sonic fiber internet in San Francisco, I get 1.5ms to the POP. It is only 4.5ms to my VM in Hurricane Electrics Fremont DC.
        fooker 783 days ago
        You don’t have to break the speed of light, just have the ping below human perception.
        ~20ms is that threshold, but even 40ms latency is barely noticeable for single player games.
        [-]
        enlyth 783 days ago
        It's quite noticeable actually, and it adds up, it's not just an extra 20ms.
        For casual gamers and turn based games maybe it could work, as a niche. For FPS, multiplayer, ARPG, and so on, it's a dealbreaker, anything over 100ms feels too sluggish.
        We should be happy we have so much autonomy with our own hardware, I don't want some big cloud company to be able to tell me what I can play and render, unless we want the "you will own nothing and be happy" meme to become reality.
        [-]
        TylerE 783 days ago
        I actually, in my testing, JRPG/other turn based games were amongst the worst because there is so much “management” (inventory, loot, gear, etc) and the extra lag really throws you off
        TylerE 783 days ago
        A wireless controller ALONE is already over 20ms, and that’s before you touch the network, actually doing with that input, wait for the display to redraw…
        At a 20ms total round trip, that only buys you about a 1500 mile radius, again completely ignoring all other latencies.
  - frozenport 783 days ago
    >> fairly shallow slope of the GPU performance curve overtime
    Not true.
nico 784 days ago
This is feeling like almost thought to launch.
In the last week, a lot of the ideas I’ve read about in the comments of HN, have then shown up as full blown projects in the front page.
As if people are building at an insane speed from idea to launch/release.
[-]
- intelVISA 783 days ago
  GPT4 + Python... the product basically writes itself!
  Until the oceans boil...
  [-]
  - robertlagrant 783 days ago
    ChatGPT-5 will be written by ChatGPT4? :)
    [-]
    - knodi123 783 days ago
      if I've been reading it correctly, the power of chatgpt is in the training and data, not necessarily the algorithm.
      And I'm not sure if it's technically possible for one AI to train another AI with the same algorithm and have better performance. Although I could be wrong about any and everything. :-)
      [-]
      - visarga 783 days ago
        A LLM by itself could generate data, code and iterate on its training process, thus it can create another LLM from scratch. There is a path to improve LLMs without organic text - connect them to real systems and allow them feedback. They can learn from feedback from their actions. It could be as simple as a Python execution environment, a game, simulator, other chat bots, or a more complex system like real world tests.
      - BizarroLand 783 days ago
        I know that NVidia is using AI that is running on NVidia chips to create new chips that they then run AI on.
        All you have left to do is to AI the process of training AI, kind of like building a lathe by hand makes a so-so lathe but that so-so lathe can then be used to build a better and more accurate lathe.
        [-]
        digdugdirk 783 days ago
        I actually love this analogy. People tend to not appreciate just how precise modern manufacturing equipment is.
        All of that modern machinery was essentially bootstrapped off a couple of relatively flat rocks. Its going to be interesting to see where this LLM stuff goes when the feedback loop is this quick and so much brainpower is focused on it.
        One of my sneaky suspicions is that Facebook/Google/Amazon/Microsoft/etc would have been better off keeping employees on the books if for no other reason than keeping thousands of skilled developers occupied, rather than cutting loose thousands of people during a time of rapid technological progress who now have an axe to grind.
        [-]
        scyzoryk_xyz 783 days ago
        It is a nice analogy because you can expand it really to all history of technological progress. Tools help make tools - all the way back to obsidian daggers and sticks.
        [-]
        qikInNdOutReply 782 days ago
        Same goes for the bellybutton, that navel connected from one living being to another, back to the first mamal.
    - kindofabigdeal 783 days ago
      Doubt
  - junon 783 days ago
    I know this is a joke but electronics cause an unmeasurably small amount of heat dissipation. It's how we generate power that's the problem.
    [-]
    - taneq 783 days ago
      Or what answers we ask the electronics for... "Univac, how do I increase entropy?" distant rumble of cooling fans
      [-]
      - arthurcolle 783 days ago
        You mean decrease entropy?
        [-]
        taneq 783 days ago
        We'll work up to that. For now, there's insufficient data for meaningful answer.
- nmfisher 784 days ago
  Just yesterday I was literally musing to myself "I wonder if NeRFs would help with 3D object synthesis", and here we are.
  It's definitely a fun time to be involved.
  [-]
  - popinman322 783 days ago
    NeRFs are a form of inverse renderer; this paper uses Score Jacobian Chaining[0] instead. Model reconstruction from NeRFs is also an active area of research. Check out the "Model Reconstruction" section of Awesome NeRF[1].
    From the SJC paper:
    > We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as function f with parameters θ, i.e., x = f (θ). Applying the chain rule through the Jacobian ∂x/∂θ converts a gradient on image x into a gradient on the parameter θ.
    > Our method uses differentiable rendering to aggregate 2D image gradients over multiple viewpoints into a 3D asset gradient, and lifts a generative model from 2D to 3D. We parameterize a 3D asset θ as a radiance field stored on voxels and choose f to be the volume rendering function.
    Interpretation: they take multiple input views, then optimize parameters (a voxel grid in this case) to a differentiable renderer (the volume rendering function for voxels) such that they can reproduce the input views.
    [0]: https://pals.ttic.edu/p/score-jacobian-chaining [1]: https://github.com/awesome-NeRF/awesome-NeRF
  - regegrt 783 days ago
    It's not based on the NeRF concept though, is it?
    Its outputs can provide the inputs for NeRF training, which is why they mention NeRFs. But it's not NeRF technology.
    [-]
  - noduerme 783 days ago
    it's actually a really fun time to know how to sculpt in ZBrush and print out models.
    [-]
    - nmfisher 783 days ago
      If I had any artistic talent whatsoever, I'd probably agree with you!
      [-]
      - noduerme 783 days ago
        I won't lie... ZBrush is brutally hard. I got a subscription for work and only used it for one paid job, ever. But it's super satisfying if you just want to spend Sunday night making a clay elephant or rhinoceros, and drop $20 to have the file printed out and shipped to you by Thursday. I've fed lots of my sculpture renderings to Dali and gotten some pretty cool 2D results... but nothing nearly as cool as the little asymmetrical epoxy sculptures I can line up on the bookshelf...
- dimatura 783 days ago
  People are definitely building at a high pace, but for what it's worth, this isn't the first work to tackle this problem, as you can see from the references. The results are impressive though!
- noduerme 783 days ago
  yeah, the road to hell is paved with a desperate need for upvotes (and angel investment).
- amelius 783 days ago
  Is image classification at the point yet where you can train it with one or a few examples (plus perhaps some textual explanation)?
  [-]
  - f38zf5vdt 783 days ago
    Image classification is still a difficult task, especially if there are only a few examples. Training a high resolution 1k multi-class imagenet on 1m+ images is a drag involving hundreds or thousands of GPU hours from scratch. You can do low-resolution classifiers more easily, but they're less accurate.
    There are tricks to do it faster but they all involve using other vision models that themselves are trained for as long.
    [-]
    - amelius 783 days ago
      But can't something like GPT help here? For example you show it a picture of a cat, then you say "this is a cat; cats are furry creatures with claws, etc." and then you show it another image and ask if it is also a cat.
      [-]
      - f38zf5vdt 783 days ago
        You are humanizing token prediction. The multimodal models for text-vision were all established using a scaffold of architectures that unified text-token and vision-token similarity e.g. BLIP2. [1] It's possible that a model using unified representations might be able to establish that the set of visual tokens you are searching for corresponds to some set of text tokens, but only if the pretrained weights for the vision encoder are able to extract the features corresponding to the object to which you are describing to the vision model.
        And the pretrained vision encoder will have at some point been trained to minimize text-visual token cosine similarity on some training set, so it really depends on what exactly that training set had in it.
        [1] https://arxiv.org/pdf/2301.12597.pdf
      - aleph_infinity 783 days ago
        This paper https://cv.cs.columbia.edu/sachit/classviadescr/ (from the same lab as the main post, funnily) does something along those lines with GPT. It shows for things that are easy to describe like Wordle ("tiled letters, some are yellow and green") you can recognize them with zero training. For things that are harder to describe we'll probably need new approaches, but it's an interesting direction.
  - GaggiX 783 days ago
    If you have a few examples you can use an already trained encoder (like CLIP image encoder) and train a SVM on the embeddings, no need to train a neural network.
- cainxinth 782 days ago
  The engineers of the future will be poets. -Terence McKenna
qikInNdOutReply 783 days ago
What happens, if you build a circle? As in this creates a 3d object from a image and another ai creates a image from a 3d object?
https://www.youtube.com/watch?v=zPqJUrfKuqs
Does it stabilize, or refine prejudices, or go on a fractal journey of errors over the weight landscape?
King-Aaron 784 days ago
That's honestly extremely impressive. I do hope that the 'in the wild' examples aren't completely curated and are actually being rendered on the fly (They appear to be, but it's hard for me to tell if that' truly the case). Pretty cool to see however.
[-]
- GaggiX 784 days ago
  >and are actually being rendered on the fly
  They are precomputed, "Note that the demo allows a limited selection of rotation angles quantized by 30 degrees due to limited storage space of the hosting server." but I don't think they are curated, the seeds probably correspond to the seeds of the live demo you can host (they released the code and the models)
wslh 783 days ago
I keep thinking in my project where we are taking multiple photos from the same angle with moving lights for rebuilding the 3D model. We are not using AI, just optic research like in [1]. We applied that on art at [2].
[1] Methods for 3D digitization of Cultural Heritage: http://www.ipet.gr/~akoutsou/docs/M3DD.pdf
[2] https://sublim.art
[-]
- bogwog 783 days ago
  So the business model there is: scanner + paper shredder + NFT = $$$?
  How many people have taken you up on that offer? Unless it's a shitty/low-effort painting, it seems insane to me that anyone would destroy their artwork in exchange for an NFT of that same artwork.
  [-]
  - wslh 783 days ago
    What is insane for you could be completely different for others: we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later (working now) but you can see AP coverage here [1].
    It is also important to highlight that we are doing this project at our own risk, with our own money, have built the hardware and software, and not charging artists for the process. Just the primary market sell is split between 85% for artists and the rest for the project. Pretty generous in this risky market.
    [1] https://youtu.be/ajDEHSLi0iE
    [-]
    - bogwog 783 days ago
      > we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later
      Please also include the number of those people who actually understand what an NFT is. As a native Miamian, I can guarantee you not a single one does. This city has always been a magnet for the get rich quick scheme types, and crypto is a good match for that because it's harder for a layman to grasp the scam part.
      [-]
      - wslh 782 days ago
        We should start to talk about what an NFT and POAP is for you. What we do is a new concept where you can probe the physical object does not exist anymore and it is now digital. The NFT is part of this experiment. It is an experiment for us and for the artists.
    - tough 783 days ago
      It's Banksy as a Service
- JeremyBanks 782 days ago
  [dead]
echelon 784 days ago
Super cool results.
This is what my startup is getting into. So I'm very interested.
These aren't "game ready" - the sculpts are pretty gross. But we're clearly getting somewhere. It's only going to keep getting better.
I expect we'll be building all new kinds of game engines, render pipelines, and 3D animation tools shortly.
[-]
- redox99 784 days ago
  While this is cool, this is not meant to target "game ready". For games and CGI, there's no reason to limit yourself to a single image. Photogrammetry is already extensively used, and it involves using tens or hundreds of images of the object to scan. Using many images as an input will obviously always be superior to a single one, as a single image means it has to literally make up the back side, and it has no parallax information.
  [-]
  - oefrha 784 days ago
    You appear to be thinking about scanning a physical object, whereas zero-shot one image to 3D object would be vastly more useful with a single (possibly AI-generated or AI-assisted) illustration. You get a 3D model in seconds at essentially zero cost, can iterate hundreds of times in a single day.
    [-]
    - redox99 784 days ago
      I agree that for stylized, painting-like 3D models it could be very cool. I was indeed thinking of the typical pipeline for a photoreallistic asset.
  - digilypse 784 days ago
    What if I have a dynamically generated character description in my game’s world, generate a portrait for them using StableDiffusion and then turn that into a 3d model that can be posed and re-used?
  - flangola7 784 days ago
    This has DARPA and NSF behind it.
    They're not building this for games they're building it for autonomous weapons.
- bredren 784 days ago
  How do these kinds of tools complement actual 3d scanning?
  For example, Apple supposedly has put some time into 3d asset building (presumably in support of AR world building content).
  Can these inference techniques stack or otherwise help more detailed object data collection?
- regularfry 783 days ago
  I'd be interested in zero-shot two images to 3d object. You can see how a stereo pair ought to improve the amount of information it has available.
- nico 784 days ago
  And 3D printing. So quickly building physical tools too.
  [-]
  - skybrian 784 days ago
    For printing parts, precision matters since they likely need to fit with something else. You’ll want to be able to edit dimensions on the model to get the fit right.
    So maybe someday, but I think it would have to be a project that targets CAD.
jonplackett 783 days ago
Would this be useful for a robot / car trying to navigate to be able to do this?
[-]
- eternalban 783 days ago
  Great idea. Processing latency may be an issue. It has to be fast, small, and energy efficient.
- elif 783 days ago
  unlikely. the front bumper of a car you are following has zero value for your ego's safety. most of the optimization of FSD is in removing extra data to improve latency of the mapping loop.
  [-]
  - jonplackett 782 days ago
    But it seems like the main problem for self drive is accurately understanding the world around the car. The actual driving is pretty easy. Being able to see a partial object and understand what the rest of it is is very useful to human drivers
throwaway4aday 783 days ago
If you can produce any view angle you want of an object then can't you use photogrammetry to construct a 3D object?
[-]
- nwoli 783 days ago
  See the “Single-View 3D Reconstruction” section at the bottom where they do precisely that
  [-]
  - throwaway4aday 783 days ago
    Cool, I missed that.
lofatdairy 783 days ago
This is insanely impressive, looking at the 3D reconstruction results. If I'm not mistaken occlusions are where a lot of attentions being placed in pose estimation problems, and if there are enough annotated environmental spaces to create ground truths you could probably add environment reconstruction to pose reconstruction. What's nice there is that if you have multiple angles of an environment from a moving camera in a video, you can treat each previous frame as a prior which helps with prediction time and accuracy.
hiccuphippo 783 days ago
Can you obtain the 3d object from this or only an image with the new perspective? This could revolutionize indie gamedev.
[-]
- jxf 783 days ago
  You can obtain a 3D object, but it's more useful for the novel views than the object, because the object isn't very good and probably needs some processing. See the bottom of the paper.
mitthrowaway2 784 days ago
> We compare our reconstruction with state-of-the-art models in single-view 3D reconstruction.
Here they list "GT Mesh", "Ours", "Point-E", and "MCC". Does anyone know what technique "GT mesh" refers to? Is it simply the original mesh that generated the source image?
[-]
- GaggiX 784 days ago
  "Ground Truth", the actual mesh
- haykmartiros 784 days ago
  Ground truth
  [-]
  - EGreg 784 days ago
    Well honestly the "Ground truth" algorithm seems a lot superior to their method, it has higher fidelity in ALL the examples
    [-]
    - chaboud 784 days ago
      I read that with the sarcasm that I hope was intended and had a good laugh.
    - razemio 784 days ago
      Haha, I am sorry. I spit my coffee reading this. It is ofc totally OK to not know what ground truth means but the irony was to funny. Yes ground truth will always be superior compared to anything else :)!
      [-]
      - yorwba 783 days ago
        Ground truth will always be superior on the "does this match the ground truth?" metric, but that's often just a proxy for output quality and the model will be judged differently once deployed (e.g. "do human users like this?")
        That's something to be aware of, especially when you're using convenience data of unknown quality to evaluate your model – many research datasets scraped off the internet with little curation and labeled in a rush by low-paid workers contain a lot of SEO garbage and labeling errors.
      - EGreg 782 days ago
        I always wanted to meet the team behind Ground Truth. It’s truly remarkable what they have built. Every time AI models show up, these guys outperform them on every metric.
        Anyone have any contacts? They seem to be extremely elusive
    - sophiebits 784 days ago
      “Ground truth” doesn’t refer to a particular algorithm; it refers to the ideal benchmark of what a perfect performance would look like, which they’re grading against.
    - simlevesque 784 days ago
      Ground truth means that a human person created the model.
      [-]
      - DarthNebo 783 days ago
        Not necessarily, could also be synthetic. Google did the same for hand poses in BlazePalm
    - Thorrez 784 days ago
      Ground truth means the original model that the image was generated from.
hypertexthero 783 days ago
Brings to mind the Blade Runner enhance scene: https://www.youtube.com/watch?v=hHwjceFcF2Q
[-]
- Sakos 783 days ago
  Reminds me of this at the time fantastical scene in Enemy of the State https://youtu.be/3EwZQddc3kY
- BiteCode_dev 783 days ago
  Given the data is (credible and beautiful) BS, I think it's closer to red dwarf:
  https://www.youtube.com/watch?v=6i3NWKbBaaU
bmitc 783 days ago
What if you give it a picture of a cardboard cutout or billboard?
[-]
- noduerme 783 days ago
  it'll build Angelyne for you, to distract your pathetic carbon-based intelligence.
  https://www.hollywoodreporter.com/wp-content/uploads/2017/07...
brokensegue 784 days ago
how is this different from the previous NeRF work? does it build a 3D model?
[-]
- GaggiX 784 days ago
  NeRF models are trained on several views with known location and viewing direction. This model takes one image (and you don't need to train a model for each object).
  [-]
  - amelius 783 days ago
    But if it takes only one image, isn't it likely to hallucinate information?
    [-]
    - gs17 783 days ago
      Not just likely, it does. Try out the demo and see, e.g. what the backside of their Pikachu toy looks like. Or a little simpler, the paper has an example (the demo also has this) of the back of a car under different seeds.
  - fooker 783 days ago
    Not hotdog.
ilaksh 784 days ago
I wonder if this type of thing could be adapted to a vision system for a robot? So it would locate the camera and reconstruct an entire scene from a series of images as the robot moves around.
Probably needs a ways to get there but to be able to do robust SLAM etc. With just a single camera would make things much less expensive.
[-]
- guyomes 783 days ago
  You might be interested in this related recent work [1] that fits simple ellipsoids to images and then use them for the pose estimation of a camera.
  [1]: https://ieeexplore.ieee.org/document/9127873
  [-]
  - lefrancais 783 days ago
    Same ref [1], but open : [https://hal.science/hal-02886633/document]
gs17 784 days ago
For anyone else who tried to download the weights and got Google Drive throwing a quota error at you, they're working on it: https://github.com/cvlab-columbia/zero123/issues/2
yawnxyz 784 days ago
Are there any models that take an image to SVG?
hombre_fatal 783 days ago
Aside, I really like the UI indicators on the draggable models at the bottom that let you know you can rotate them.
desmond373 784 days ago
Would it be possible to generate cad files with this. As a base for part construction this could be game changing
[-]
- gs17 784 days ago
  If you look at the example meshes, it doesn't seem very likely that it would be better than manually creating them, unless you're okay with lumpy parts that aren't exactly the right size. This is too early for it to not require a lot of cleanup to be usable.
  [-]
  - flangola7 784 days ago
    In other words we just need to wait 6 more months
noduerme 783 days ago
Is there some kind of symmetry at work here in the deductive process?
[-]
- xotom20390 783 days ago
  [dead]
mov 783 days ago
People plugging it as output of Midjourney in 3, 2, 1...
ar9av 783 days ago
It's hard to tell for certain from the paper, without going deep into the code, but it seems they created the new model the same way the depth conditioned SD models were made i.e. normal finetune.
It might be possible to create a "original view + new angle" conditioned model much more easily by taking the Controlnet/T2IAdapter/GLIDE route where you freeze the original model.
Text to 3d seems almost close to being solved.
It also makes me think a "original character image + new pose" conditioned model would also work quite well.