Zero-1-to-3: Zero-shot One Image to 3D Object

(cs.columbia.edu)

530 points | by GaggiX 401 days ago

23 comments

  • HopenHeyHi 401 days ago
    3D reconstruction from a single image. They stress the examples are not curated, appears to... well, gosh darnit, it appears to work.

    If it runs fast enough I wonder whether one could just drive around with a webcam and generate these 3d models on the fly and even import them into a sort of GTA type simulation/game engine in real time. (To generate a novel view, Zero-1-to-3 takes only 2 seconds on an RTX A6000 GPU)

      This research is based on work partially supported by: 
      
      - Toyota Research Institute
      - DARPA MCS program under Federal Agreement No. N660011924032
      - NSF NRI Award #1925157
    
    Oh, huh. Interesting.

      Future Work
    
      From objects to scenes: 
      Generalization to scenes with complex backgrounds remains an important challenge for our method.
      
      From scenes to videos: 
      Being able to reason about geometry of dynamic scenes from a single view would open novel research directions -- 
      such as understanding occlusions and dynamic object manipulation.
    
      A few approaches for diffusion-based video generation have been proposed recently and extending them to 3D would be key to opening up these opportunities.
    • TylerE 401 days ago
      Seems like there is a bit of a gap between “runs at 0.5 fps on a $7000 workstation-grade GPU with 48GB of VRAM” and consumer applications.

      With the fairly shallow slope of the GPU performance curve overtime, I don’t see them just Moores Lawing out of it either. this would need two, maybe three orders of magnitude more performance.

      • HopenHeyHi 401 days ago
        Of course there is a gap. This is at the exploratory proof of concept stage. The fact that it works at all is what is interesting.

        Furthermore once you've identified the make and model of the car, its relative position in 3d, any anomalies -- that ain't just a Ford pickup, it is loaded with cargo that overhangs in a particular way -- its velocity, 'etc -- I'm quite sure that extrapolating additional information from the subsequent frames will be significantly cheaper as you don't have to generate a 3d model from scratch each time.

        I think this is a viable exploratory path forward.

            Make it work <- you are here
            Make it work correctly
            Make it work fast
        
        Edit: Scotty does know ;)
        • scotty79 401 days ago
          I prefer:

              Make it work <- you are here
              Make it work correctly
              Make it work fast
      • jiggawatts 401 days ago
        Computer power goes up exponentially thanks to Moore's law. Sprinkle some software optimisations on top, and it's conceivable for that to be running at interactive framerates on consumer GPUs within 5-10 years.
        • nitwit005 400 days ago
          > Computer power goes up exponentially thanks to Moore's law

          If you look at a graph, that stopped being true well over a decade ago.

          • jiggawatts 400 days ago
            Only for single-threaded programs. Multi-threaded performance continued along the curve unabated.
            • nitwit005 400 days ago
              Moore's law is specifically about the number (density) of transistors in an integrated circuit.

              You could always get as much parallelism as you wanted by adding more chips.

        • ffitch 401 days ago
          the processing may as well shift to the clouds. With the subscription fee, of course : )
          • TylerE 401 days ago
            Until we break the speed of light, I’m very bearish on cloud gaming. It just feels so bad. You’ve got like 9 layers of latency between you and the screen.
            • jimmySixDOF 401 days ago
              One possible definition of Edge Compute is GPU capacity at every last mile POP
              • jlokier 401 days ago
                I agree, though my last mile latency to the nearest POP is about 85ms. Still a bit on the high side for action games compared with playing locally.
                • kijiki 401 days ago
                  85ms, holy crap, who is your ISP?

                  On Sonic fiber internet in San Francisco, I get 1.5ms to the POP. It is only 4.5ms to my VM in Hurricane Electrics Fremont DC.

            • fooker 401 days ago
              You don’t have to break the speed of light, just have the ping below human perception.

              ~20ms is that threshold, but even 40ms latency is barely noticeable for single player games.

              • enlyth 401 days ago
                It's quite noticeable actually, and it adds up, it's not just an extra 20ms.

                For casual gamers and turn based games maybe it could work, as a niche. For FPS, multiplayer, ARPG, and so on, it's a dealbreaker, anything over 100ms feels too sluggish.

                We should be happy we have so much autonomy with our own hardware, I don't want some big cloud company to be able to tell me what I can play and render, unless we want the "you will own nothing and be happy" meme to become reality.

                • TylerE 401 days ago
                  I actually, in my testing, JRPG/other turn based games were amongst the worst because there is so much “management” (inventory, loot, gear, etc) and the extra lag really throws you off
              • TylerE 401 days ago
                A wireless controller ALONE is already over 20ms, and that’s before you touch the network, actually doing with that input, wait for the display to redraw…

                At a 20ms total round trip, that only buys you about a 1500 mile radius, again completely ignoring all other latencies.

      • frozenport 401 days ago
        >> fairly shallow slope of the GPU performance curve overtime

        Not true.

  • nico 401 days ago
    This is feeling like almost thought to launch.

    In the last week, a lot of the ideas I’ve read about in the comments of HN, have then shown up as full blown projects in the front page.

    As if people are building at an insane speed from idea to launch/release.

    • intelVISA 401 days ago
      GPT4 + Python... the product basically writes itself!

      Until the oceans boil...

      • robertlagrant 401 days ago
        ChatGPT-5 will be written by ChatGPT4? :)
        • knodi123 401 days ago
          if I've been reading it correctly, the power of chatgpt is in the training and data, not necessarily the algorithm.

          And I'm not sure if it's technically possible for one AI to train another AI with the same algorithm and have better performance. Although I could be wrong about any and everything. :-)

          • visarga 401 days ago
            A LLM by itself could generate data, code and iterate on its training process, thus it can create another LLM from scratch. There is a path to improve LLMs without organic text - connect them to real systems and allow them feedback. They can learn from feedback from their actions. It could be as simple as a Python execution environment, a game, simulator, other chat bots, or a more complex system like real world tests.
          • BizarroLand 401 days ago
            I know that NVidia is using AI that is running on NVidia chips to create new chips that they then run AI on.

            All you have left to do is to AI the process of training AI, kind of like building a lathe by hand makes a so-so lathe but that so-so lathe can then be used to build a better and more accurate lathe.

            • digdugdirk 400 days ago
              I actually love this analogy. People tend to not appreciate just how precise modern manufacturing equipment is.

              All of that modern machinery was essentially bootstrapped off a couple of relatively flat rocks. Its going to be interesting to see where this LLM stuff goes when the feedback loop is this quick and so much brainpower is focused on it.

              One of my sneaky suspicions is that Facebook/Google/Amazon/Microsoft/etc would have been better off keeping employees on the books if for no other reason than keeping thousands of skilled developers occupied, rather than cutting loose thousands of people during a time of rapid technological progress who now have an axe to grind.

              • scyzoryk_xyz 400 days ago
                It is a nice analogy because you can expand it really to all history of technological progress. Tools help make tools - all the way back to obsidian daggers and sticks.
                • qikInNdOutReply 400 days ago
                  Same goes for the bellybutton, that navel connected from one living being to another, back to the first mamal.
        • kindofabigdeal 401 days ago
          Doubt
      • junon 401 days ago
        I know this is a joke but electronics cause an unmeasurably small amount of heat dissipation. It's how we generate power that's the problem.
        • taneq 401 days ago
          Or what answers we ask the electronics for... "Univac, how do I increase entropy?" distant rumble of cooling fans
          • arthurcolle 401 days ago
            You mean decrease entropy?
            • taneq 400 days ago
              We'll work up to that. For now, there's insufficient data for meaningful answer.
    • nmfisher 401 days ago
      Just yesterday I was literally musing to myself "I wonder if NeRFs would help with 3D object synthesis", and here we are.

      It's definitely a fun time to be involved.

      • popinman322 401 days ago
        NeRFs are a form of inverse renderer; this paper uses Score Jacobian Chaining[0] instead. Model reconstruction from NeRFs is also an active area of research. Check out the "Model Reconstruction" section of Awesome NeRF[1].

        From the SJC paper:

        > We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as function f with parameters θ, i.e., x = f (θ). Applying the chain rule through the Jacobian ∂x/∂θ converts a gradient on image x into a gradient on the parameter θ.

        > Our method uses differentiable rendering to aggregate 2D image gradients over multiple viewpoints into a 3D asset gradient, and lifts a generative model from 2D to 3D. We parameterize a 3D asset θ as a radiance field stored on voxels and choose f to be the volume rendering function.

        Interpretation: they take multiple input views, then optimize parameters (a voxel grid in this case) to a differentiable renderer (the volume rendering function for voxels) such that they can reproduce the input views.

        [0]: https://pals.ttic.edu/p/score-jacobian-chaining [1]: https://github.com/awesome-NeRF/awesome-NeRF

      • regegrt 401 days ago
        It's not based on the NeRF concept though, is it?

        Its outputs can provide the inputs for NeRF training, which is why they mention NeRFs. But it's not NeRF technology.

      • noduerme 401 days ago
        it's actually a really fun time to know how to sculpt in ZBrush and print out models.
        • nmfisher 401 days ago
          If I had any artistic talent whatsoever, I'd probably agree with you!
          • noduerme 401 days ago
            I won't lie... ZBrush is brutally hard. I got a subscription for work and only used it for one paid job, ever. But it's super satisfying if you just want to spend Sunday night making a clay elephant or rhinoceros, and drop $20 to have the file printed out and shipped to you by Thursday. I've fed lots of my sculpture renderings to Dali and gotten some pretty cool 2D results... but nothing nearly as cool as the little asymmetrical epoxy sculptures I can line up on the bookshelf...
    • dimatura 401 days ago
      People are definitely building at a high pace, but for what it's worth, this isn't the first work to tackle this problem, as you can see from the references. The results are impressive though!
    • noduerme 401 days ago
      yeah, the road to hell is paved with a desperate need for upvotes (and angel investment).
    • amelius 401 days ago
      Is image classification at the point yet where you can train it with one or a few examples (plus perhaps some textual explanation)?
      • f38zf5vdt 401 days ago
        Image classification is still a difficult task, especially if there are only a few examples. Training a high resolution 1k multi-class imagenet on 1m+ images is a drag involving hundreds or thousands of GPU hours from scratch. You can do low-resolution classifiers more easily, but they're less accurate.

        There are tricks to do it faster but they all involve using other vision models that themselves are trained for as long.

        • amelius 401 days ago
          But can't something like GPT help here? For example you show it a picture of a cat, then you say "this is a cat; cats are furry creatures with claws, etc." and then you show it another image and ask if it is also a cat.
          • f38zf5vdt 401 days ago
            You are humanizing token prediction. The multimodal models for text-vision were all established using a scaffold of architectures that unified text-token and vision-token similarity e.g. BLIP2. [1] It's possible that a model using unified representations might be able to establish that the set of visual tokens you are searching for corresponds to some set of text tokens, but only if the pretrained weights for the vision encoder are able to extract the features corresponding to the object to which you are describing to the vision model.

            And the pretrained vision encoder will have at some point been trained to minimize text-visual token cosine similarity on some training set, so it really depends on what exactly that training set had in it.

            [1] https://arxiv.org/pdf/2301.12597.pdf

          • aleph_infinity 401 days ago
            This paper https://cv.cs.columbia.edu/sachit/classviadescr/ (from the same lab as the main post, funnily) does something along those lines with GPT. It shows for things that are easy to describe like Wordle ("tiled letters, some are yellow and green") you can recognize them with zero training. For things that are harder to describe we'll probably need new approaches, but it's an interesting direction.
      • GaggiX 401 days ago
        If you have a few examples you can use an already trained encoder (like CLIP image encoder) and train a SVM on the embeddings, no need to train a neural network.
    • cainxinth 399 days ago
      The engineers of the future will be poets. -Terence McKenna
  • qikInNdOutReply 401 days ago
    What happens, if you build a circle? As in this creates a 3d object from a image and another ai creates a image from a 3d object?

    https://www.youtube.com/watch?v=zPqJUrfKuqs

    Does it stabilize, or refine prejudices, or go on a fractal journey of errors over the weight landscape?

  • King-Aaron 401 days ago
    That's honestly extremely impressive. I do hope that the 'in the wild' examples aren't completely curated and are actually being rendered on the fly (They appear to be, but it's hard for me to tell if that' truly the case). Pretty cool to see however.
    • GaggiX 401 days ago
      >and are actually being rendered on the fly

      They are precomputed, "Note that the demo allows a limited selection of rotation angles quantized by 30 degrees due to limited storage space of the hosting server." but I don't think they are curated, the seeds probably correspond to the seeds of the live demo you can host (they released the code and the models)

  • wslh 401 days ago
    I keep thinking in my project where we are taking multiple photos from the same angle with moving lights for rebuilding the 3D model. We are not using AI, just optic research like in [1]. We applied that on art at [2].

    [1] Methods for 3D digitization of Cultural Heritage: http://www.ipet.gr/~akoutsou/docs/M3DD.pdf

    [2] https://sublim.art

    • bogwog 401 days ago
      So the business model there is: scanner + paper shredder + NFT = $$$?

      How many people have taken you up on that offer? Unless it's a shitty/low-effort painting, it seems insane to me that anyone would destroy their artwork in exchange for an NFT of that same artwork.

      • wslh 401 days ago
        What is insane for you could be completely different for others: we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later (working now) but you can see AP coverage here [1].

        It is also important to highlight that we are doing this project at our own risk, with our own money, have built the hardware and software, and not charging artists for the process. Just the primary market sell is split between 85% for artists and the rest for the project. Pretty generous in this risky market.

        [1] https://youtu.be/ajDEHSLi0iE

        • bogwog 401 days ago
          > we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later

          Please also include the number of those people who actually understand what an NFT is. As a native Miamian, I can guarantee you not a single one does. This city has always been a magnet for the get rich quick scheme types, and crypto is a good match for that because it's harder for a layman to grasp the scam part.

          • wslh 400 days ago
            We should start to talk about what an NFT and POAP is for you. What we do is a new concept where you can probe the physical object does not exist anymore and it is now digital. The NFT is part of this experiment. It is an experiment for us and for the artists.
        • tough 401 days ago
          It's Banksy as a Service
    • JeremyBanks 399 days ago
      [dead]
  • echelon 401 days ago
    Super cool results.

    This is what my startup is getting into. So I'm very interested.

    These aren't "game ready" - the sculpts are pretty gross. But we're clearly getting somewhere. It's only going to keep getting better.

    I expect we'll be building all new kinds of game engines, render pipelines, and 3D animation tools shortly.

    • redox99 401 days ago
      While this is cool, this is not meant to target "game ready". For games and CGI, there's no reason to limit yourself to a single image. Photogrammetry is already extensively used, and it involves using tens or hundreds of images of the object to scan. Using many images as an input will obviously always be superior to a single one, as a single image means it has to literally make up the back side, and it has no parallax information.
      • oefrha 401 days ago
        You appear to be thinking about scanning a physical object, whereas zero-shot one image to 3D object would be vastly more useful with a single (possibly AI-generated or AI-assisted) illustration. You get a 3D model in seconds at essentially zero cost, can iterate hundreds of times in a single day.
        • redox99 401 days ago
          I agree that for stylized, painting-like 3D models it could be very cool. I was indeed thinking of the typical pipeline for a photoreallistic asset.
      • digilypse 401 days ago
        What if I have a dynamically generated character description in my game’s world, generate a portrait for them using StableDiffusion and then turn that into a 3d model that can be posed and re-used?
      • flangola7 401 days ago
        This has DARPA and NSF behind it.

        They're not building this for games they're building it for autonomous weapons.

    • bredren 401 days ago
      How do these kinds of tools complement actual 3d scanning?

      For example, Apple supposedly has put some time into 3d asset building (presumably in support of AR world building content).

      Can these inference techniques stack or otherwise help more detailed object data collection?

    • regularfry 401 days ago
      I'd be interested in zero-shot two images to 3d object. You can see how a stereo pair ought to improve the amount of information it has available.
    • nico 401 days ago
      And 3D printing. So quickly building physical tools too.
      • skybrian 401 days ago
        For printing parts, precision matters since they likely need to fit with something else. You’ll want to be able to edit dimensions on the model to get the fit right.

        So maybe someday, but I think it would have to be a project that targets CAD.

  • jonplackett 401 days ago
    Would this be useful for a robot / car trying to navigate to be able to do this?
    • eternalban 401 days ago
      Great idea. Processing latency may be an issue. It has to be fast, small, and energy efficient.
    • elif 401 days ago
      unlikely. the front bumper of a car you are following has zero value for your ego's safety. most of the optimization of FSD is in removing extra data to improve latency of the mapping loop.
      • jonplackett 400 days ago
        But it seems like the main problem for self drive is accurately understanding the world around the car. The actual driving is pretty easy. Being able to see a partial object and understand what the rest of it is is very useful to human drivers
  • throwaway4aday 401 days ago
    If you can produce any view angle you want of an object then can't you use photogrammetry to construct a 3D object?
    • nwoli 401 days ago
      See the “Single-View 3D Reconstruction” section at the bottom where they do precisely that
  • lofatdairy 401 days ago
    This is insanely impressive, looking at the 3D reconstruction results. If I'm not mistaken occlusions are where a lot of attentions being placed in pose estimation problems, and if there are enough annotated environmental spaces to create ground truths you could probably add environment reconstruction to pose reconstruction. What's nice there is that if you have multiple angles of an environment from a moving camera in a video, you can treat each previous frame as a prior which helps with prediction time and accuracy.
  • hiccuphippo 401 days ago
    Can you obtain the 3d object from this or only an image with the new perspective? This could revolutionize indie gamedev.
    • jxf 401 days ago
      You can obtain a 3D object, but it's more useful for the novel views than the object, because the object isn't very good and probably needs some processing. See the bottom of the paper.
  • mitthrowaway2 401 days ago
    > We compare our reconstruction with state-of-the-art models in single-view 3D reconstruction.

    Here they list "GT Mesh", "Ours", "Point-E", and "MCC". Does anyone know what technique "GT mesh" refers to? Is it simply the original mesh that generated the source image?

    • GaggiX 401 days ago
      "Ground Truth", the actual mesh
    • haykmartiros 401 days ago
      Ground truth
      • EGreg 401 days ago
        Well honestly the "Ground truth" algorithm seems a lot superior to their method, it has higher fidelity in ALL the examples
        • chaboud 401 days ago
          I read that with the sarcasm that I hope was intended and had a good laugh.
        • razemio 401 days ago
          Haha, I am sorry. I spit my coffee reading this. It is ofc totally OK to not know what ground truth means but the irony was to funny. Yes ground truth will always be superior compared to anything else :)!
          • yorwba 401 days ago
            Ground truth will always be superior on the "does this match the ground truth?" metric, but that's often just a proxy for output quality and the model will be judged differently once deployed (e.g. "do human users like this?")

            That's something to be aware of, especially when you're using convenience data of unknown quality to evaluate your model – many research datasets scraped off the internet with little curation and labeled in a rush by low-paid workers contain a lot of SEO garbage and labeling errors.

          • EGreg 400 days ago
            I always wanted to meet the team behind Ground Truth. It’s truly remarkable what they have built. Every time AI models show up, these guys outperform them on every metric.

            Anyone have any contacts? They seem to be extremely elusive

        • sophiebits 401 days ago
          “Ground truth” doesn’t refer to a particular algorithm; it refers to the ideal benchmark of what a perfect performance would look like, which they’re grading against.
        • simlevesque 401 days ago
          Ground truth means that a human person created the model.
          • DarthNebo 401 days ago
            Not necessarily, could also be synthetic. Google did the same for hand poses in BlazePalm
        • Thorrez 401 days ago
          Ground truth means the original model that the image was generated from.
  • hypertexthero 401 days ago
    Brings to mind the Blade Runner enhance scene: https://www.youtube.com/watch?v=hHwjceFcF2Q
  • bmitc 401 days ago
    What if you give it a picture of a cardboard cutout or billboard?
  • brokensegue 401 days ago
    how is this different from the previous NeRF work? does it build a 3D model?
    • GaggiX 401 days ago
      NeRF models are trained on several views with known location and viewing direction. This model takes one image (and you don't need to train a model for each object).
      • amelius 401 days ago
        But if it takes only one image, isn't it likely to hallucinate information?
        • gs17 401 days ago
          Not just likely, it does. Try out the demo and see, e.g. what the backside of their Pikachu toy looks like. Or a little simpler, the paper has an example (the demo also has this) of the back of a car under different seeds.
      • fooker 401 days ago
        Not hotdog.
  • ilaksh 401 days ago
    I wonder if this type of thing could be adapted to a vision system for a robot? So it would locate the camera and reconstruct an entire scene from a series of images as the robot moves around.

    Probably needs a ways to get there but to be able to do robust SLAM etc. With just a single camera would make things much less expensive.

  • gs17 401 days ago
    For anyone else who tried to download the weights and got Google Drive throwing a quota error at you, they're working on it: https://github.com/cvlab-columbia/zero123/issues/2
  • yawnxyz 401 days ago
    Are there any models that take an image to SVG?
  • hombre_fatal 401 days ago
    Aside, I really like the UI indicators on the draggable models at the bottom that let you know you can rotate them.
  • desmond373 401 days ago
    Would it be possible to generate cad files with this. As a base for part construction this could be game changing
    • gs17 401 days ago
      If you look at the example meshes, it doesn't seem very likely that it would be better than manually creating them, unless you're okay with lumpy parts that aren't exactly the right size. This is too early for it to not require a lot of cleanup to be usable.
      • flangola7 401 days ago
        In other words we just need to wait 6 more months
  • noduerme 401 days ago
    Is there some kind of symmetry at work here in the deductive process?
  • mov 401 days ago
    People plugging it as output of Midjourney in 3, 2, 1...
  • ar9av 401 days ago
    It's hard to tell for certain from the paper, without going deep into the code, but it seems they created the new model the same way the depth conditioned SD models were made i.e. normal finetune.

    It might be possible to create a "original view + new angle" conditioned model much more easily by taking the Controlnet/T2IAdapter/GLIDE route where you freeze the original model.

    Text to 3d seems almost close to being solved.

    It also makes me think a "original character image + new pose" conditioned model would also work quite well.