Computer Vision and the Future of Imagery in the Metaverse
Amit Jain, Co-Founder of Luma AI, joins Patrick Cozzi (Cesium) and Marc Petit (Epic Games) to discuss the future of 3D and neural rendering.
Guests
Listen
Subscribe
Watch
Read
Announcer:
Today, on Building the Open Metaverse.
Amit Jain:
Imagine if you had to learn about the complexities of a Word document's internal structure to write our document. The era of desktop publishing would never have happened.
That's how 3D is being held back at the moment.
Announcer:
Welcome to Building the Open Metaverse, where technology experts discuss how the community is building the open metaverse together, hosted by Patrick Cozzi from Cesium and Marc Petit from Epic Games.
Marc Petit:
Hello, my name is Mark Petit from Epic Games, and my co-host is Patrick Cozzi from Cesium.
Patrick, how are you today?
Patrick Cozzi:
Hey, Mark. Today I'm at the GeoBiz Conference in lovely Monterey, California, so I'm doing well; it would be hard not to.
Marc Petit:
And today, we're super happy to welcome Amit Jain, he's the founder of Luma AI, to the podcast.
Amit Jain:
It's great to be here, thanks for having me.
Marc Petit:
We're very excited. We want to understand, we look at you and the technology that Luma is producing as very important for the future.
Patrick Cozzi:
Amit, we're looking forward to learning a lot today. We always love to start off the show by asking our guests about how you got started in the metaverse. You've had an amazing journey spending time at Apple, and now, as an entrepreneur, tell us about it.
Amit Jain:
I think I got started in this field with the LiDAR sensor that came out on the iPhone and iPad Pro first.
We were working on that in 2018, 2019 at Apple, and it was really cool. It was a really interesting thing to work on, honestly. To learn about how we would go about building experiences for it, to learn about what kinds of things can be possible when a device is not just capable of capturing images but also sensing the world around it, being able to see the 3D structure.
Lots of deficiencies that come with these kinds of sensors, how to work with them correctly, but that was my start, especially in this field.
From that point on, I worked in the org that builds AR kits, that builds measure, that builds the kind of AR experiences that you have come to know about from Apple, and we've worked on a lot of interesting projects, some in the future, some have already been released, and that was my start, especially on the metaverse side of things.
If I were to think about why that was interesting to me, I think the idea that we could recreate reality, the idea that we could be somewhere where we are not, is the most exciting thing to me.
Computers so far, they can show us somewhere, we can look at photos, we can look at videos, they can kind of show us, "Hey, this is how this place looks like." And that's fine; huge potential. Instagram now is worth what? $ 20, $30 billion, just inside Meta, just because of that.
I think what's the next step, what's really cool, is if our computers can actually take us somewhere, and I think that is really fascinating. You might've seen videos on YouTube, these older people who have never used AR, VR, whatever have you, and they put on Quest, and it just transports them. The level of surprise, the level of amazement, I think that is really, really cool, so that's my fascination with this field.
Marc Petit:
We can think whatever we want to think about the business practices of your ex-employer, but when you take something like room capture, when they put their mind to solving a very practical problem, which is understanding the room, the results are fascinating, and this room capture capability is absolutely amazing.
Amit Jain:
I think you're absolutely right.
There's the utilitarian side of things. We have talked about, again, people using digital twins for industrial use cases, that side of things, and then there's also the fun and emotional side of things.
For any technology to really take off, especially a technology that we rely on the consumer side of things, because many people have made efforts to make headsets for just industry use cases, and they're great. I learned from people working in the automotive industry, how much they're using AR/VR for just designing, and then collaborating on physical product design. I think that's very cool.
But for anything to become as big as phones, for anything to become even as big as game consoles, I believe you need that emotional pull. I believe you need that side where you're like, "Why would someone put it on?"
Then if they do, what do they feel? If they do, what does the technology help them achieve? I think that's very important to think about. For AR/VR, for a long time, it has been about games. You can play games, and that was the story for the longest time, but I believe that is very short-sighted. It's really cool. Playing games in AR/VR, it's so fantastic, but I think we need to think beyond, "Okay, why does a person who is not just interested in games, who just uses their phone for most of their computing, why would they put it on?" I believe, at least for me personally, that answer is, "Yeah, I can be sitting in your room right now; we could be having this conversation face to face, and what does it take to build that technology?"
Luma is taking a few steps in that area, and that's what we are working on.
Marc Petit:
Let's talk about Luma AI. What technologies are you focusing on today, and what are the challenges that you're trying to overcome?
Amit Jain:
Our goal is to basically let computers take us somewhere. To do that, you have to capture the there, you have to bring the there into the computer, right? From my experience in my previous work and our team at this point, combined, we have an immense amount of experience working on "older technologies," photogrammetry, and 3D reconstruction as light fields, reconstruction in general, and some of the newer stuff. Neural reconstructions, NeRFs, that breed of stuff.
The problem we have had with older-generation stuff is that it doesn't get close enough. It doesn't get you to a place where, when someone who's not in computer science or computer vision, or even from graphics, my mom or regular people, they look at it, the results aren't good enough.
The results aren't quite where you're like, "Oh yeah, I'm there, and I'm lost in this area." You need something that when a regular person who doesn't have a background in this area looks at it, they're like, "Yeah, this looks like a photo, or this looks like a video." That's where you want to go. And at Luma, at the crux of it, that is what we are trying to solve.
As it turns out, most of the traditional heuristics-based approaches, which is the bucket we call it in, where photogrammetry, traditional graphi cs pipelines, of taking a few images, image-based rendering, those sorts of things, they're fundamentally heuristics-based approaches.
You have some rules of how you want to actually measure in the scene, what features you're going to take in, how you're going to triangulate them, and how you’re going to then project textures on those things. Hand-tuned, handwritten, you can change them quite a bit, so that's a good part; by the way, they're controllable, but also the bad part is, most of the time when you're going to make a very robust solution, you don't end up with something that this works generally reliably for a lot of people.
That's where ML comes into play because this is what machine learning is very good at, taking problems that have lots and lots of parameters, lots and lots of tunables, and then also parameters which we don't understand how to model yet, theoretically. We took that approach, especially after the NeRF paper came out in 2020; we started looking at that as this is the first time when we look at results, you have to tell a person, "Hey, this is not a recorded video, this is not the video we recorded, this is the thing we generated."
For someone, especially in computer vision and motion graphics, that is the highest level of compliment, where the images you generated look as good as they look real. In gaming, that's kind of the panacea of, "How can we get the game to look as real as possible to real life?" Of course, some games go in the other direction. The technology we are working on is this ML graphics, basically. This idea of how can we reproduce reality in such a way that it just looks so real that it can fulfill that goal of bringing reality into the computer?
For us, that currently looks like neural graphics and generative modeling.
Patrick Cozzi:
So, Amit, my girlfriend and I are actually both users of Luma. We both have it on our phone, and we use it to capture some objects, and I wanted to ask you from the user perspective if you could talk a bit about how Luma is different, whether it's how it compares to photogrammetry and how it might work with different types of surfaces, like reflective or transparent, or just how it makes it easier to use something like a phone compared to expensive LiDAR?
Amit Jain:
It's much more forgiving than photogrammetry. There are technical aspects where it is different, or could not be actually more different, but the primary thing is very forgiving. You can capture in lots of different kinds of lighting conditions, you don't have to be that regimented in how much overlap you have, and you don't have to worry so much about lighting changing a little bit while you're capturing.
In photogrammetry, it's imperative that a point, a physical point, so let's get into how photogrammetry works and how neural rendering works as a contrast to that.
Photogrammetry, what you're trying to do is, let's say, you have an object, and you take various photos of it with enough overlap between them; generally, the good heuristic is you need 70% overlap, but you can, of course, tune it where you know can get 10%, you can work with that. But again, a good heuristic is about 70% of the images are overlapping.
If you're capturing anything meaningfully large, that's a large number of images, especially if you're going in a scene, you're not talking about a huge number of images that you're capturing, so that's fun. Then basically, you go and find all the matching features in between them, so to the people not aware of what that means, these are hand-tuned features, so one is called SIFT.
There are some newer features now, which are also ML-generated features, which are also pretty cool. There are features that are more proprietary from, I believe, Magic Leap and Snap, they have some research in this area, but there are features.
You say, if I'm capturing, let's say, this notebook I have. If I'm looking at this from an angle at this point, and from an angle at this point, this ring should look basically the same, and I'm trying to triangulate this area. I should array here, and I say, "Look, all right, I have a patch of image right there, and from here, I have a patch of image right here. I'm going to say basically they look the same."
By doing this iteratively again and again and again, you end up with tens of thousands, sometimes hundreds of thousands, sometimes millions, depending on your scene, depending on the settings, millions of these features. Now imagine that notebook has these dots on it all over the place, not physically, just in the algorithm; you have those features. Those features of what you track between images, you say, "All right, I know this point. It's supposed to look white. From this image, I can see this point again, it looks white. Great, this means the camera must have moved a little bit, but I can still see the same thing," so the end goal of that process is, you find out where the cameras were when you photographed that object.
Now, you have a 3D setup, and it's pretty impressive that it works, and it works actually reliably enough, not super reliably, but reliably enough
There are packages like RealityCapture from Epic, there's Agisoft Metashape, there's COLMAP that people use for that's the first step of the pipeline, the localization part of the pipeline. Once you've done that, you know where cameras are in the world, and you can then basically start the process of figuring out the 3D structure of the scene. It's a pretty involved thing.
The whole pipeline we call structure from motion, where the motion was your camera moving in the scene, and then you try to get the structure of it. After a lot of computation, you basically get a mesh structure, a 3D volume, on which you projected the textures. In that process, when you are generating that mesh, basically, it's a sampling process. You are trying to say, "Given the points I had, how can I connect them so that I get the rough structure of the 3D object?"
When I have that rough structure of the 3D object, you take the images you had captured and project that image or project the colors from that image onto that mesh, and now I have a textured object. It actually works really well. It has been attuned for decades and decades at this point and works really well, but in certain circumstances.
That's one of the biggest problems with photogrammetry. It works really well when you have scenes that are diffused or objects that are diffused . By that, I mean that there's not much reflection.
What happens when you have reflection? Well, very basically, why does it fail?
Instead of that really rough notebook, let's say I have a very shiny iPhone right here, and I'm trying to capture that. The problem with capturing this iPhone is that, as you're seeing, as I'm moving it, it looks different from different angles. The same problem when you're trying to capture it from this angle versus this angle; it looks different. When you're trying to do the feature matching part of the structure-from-motion pipeline, features don't match, and you end up with a geometry that looks completely blown out, that looks completely hollow.
These are the two ends of the spectrum. You have something really shiny like a phone, the screen of a phone, or you have something really diffused like paper. In that spectrum, in the middle of it lies most of the objects in the world. As it turns out, though, most of the objects are a little bit closer to the shiny side. Most of the objects are a little bit reflective, and most of the objects have a little bit of shine, in our built world at least, and in the real world, too, actually.
Photogrammetry tends to fail in those situations. That's where neural rendering machine learning approaches come into play because yes, the first step of all of these approaches is still localization. You're still running COLMAP, you're still running your MetaShape or some form of structure of motion to localize the cameras first of all, but that's where the similarities end.
Now, with the actual neural reconstruction process, you can get into a lot of detail for it. It’s technically very simple, to be very honest with you. Let's talk about the simplicity of that and how it actually works.
Let's imagine you were capturing that thing. Let’s say it's in the middle of it. Now, I realize I'm in a podcast, I should be using this mic, but I'm just going to use it as a prop. This is kind of in the middle of those two things. On the left, you have this notebook that you had, super rough; on the right, you have this phone, super smooth and reflective. In the middle, you have something which is just kind of shiny, not really, whatever. This object, now, let's say you took 50 images, you just walk around it roughly, you didn't take too much care of it, you took 50 images.
The process of neural rendering or NeRF, specifically NeRF reconstruction, is just that you keep, let's say, 10 of these objects as your test set and 40 of these for your training set. Now, this test set is your validation set, these 40 are your training set. For any ML pipeline, the three things you have is test data, training data, and the objective function. The objective function is what you're trying to achieve with the training data, and you validate it with the test data that you already have.
In our case, the training data is 40 images, the test data is 10 images. It's not exactly split like that, and there are many more details to it, but I'm just trying to paint the picture here. The objective function you're trying to minimize, or the objective function you're trying to achieve, is how photorealistic are my rendered images compared to the images I'm holding back in tests. Very easily, you give a neural network these 40 images from different viewpoints, and you tell it, "Hey, your goal is to learn how to render these 10 images that I've held out, that I'm not showing you."
Now, those 10 images are from a different viewpoint compared to the 40 you already have, so the network has to learn the 3D representation or some form of 3D representation to be able to render those other 10 images. It's not memorizing the same, or it's not trying to reproduce the same angles. It’s trying to learn new angles. That is hard.
Basically, that’s the gist of NeRF, where you're trying to learn how to render those held-out images or new viewpoints from the existing viewpoints.
Some people call NeRF not really reconstruction, they call it a novel view synthesis, and that's fine. But that's not the name of the process, and that's not the right way to describe the process. It's a use case for the process, or it's a thing you can do with the process, but if you reason for various principles, the internal representation has to learn something about the 3D structure of the object to accomplish this goal. Otherwise, it can't.
Anyway, you train, you train, you train. Initially, you were training for about 24 hours, now it's about four minutes for us at Luma, or it depends, different kinds of scenes, it varies, it could vary from four to 15 minutes, whatever have you. That's about it. You keep training until you are convinced that, yeah, the images I'm rendering pretty much match the images that my camera captured, I'm not going to stop training.
Once you have that, you have this internal representation that is your 3D object, and now you can basically look at it from whichever viewpoint, and now you have a 3D scene or a 3D object, whatever you want to call it, that looks very realistic.
That's NeRF, much more learned compared to photogrammetry, where pretty much every step of the pipeline is hand-tuned and heuristics defined. Here we just give it to a big neural network, or actually, technically, a very small neural network, and we ask it, "Learn the entire graphics pipeline or learn the entire inverse rendering pipeline. The only inputs you have are these images and the camera views, and it's your job to figure out how to match features in it, how to map colors into it, and how to actually get the right details.
We don't know, we don't care. It's the job of the network and the training and optimization process to figure out." Now, there are a lot more details there while rendering and all that sort of stuff, but we can go into that if that is necessary.
Marc Petit:
Sounds like magic to me. How dependent is the quality of the output on the quality of the input? Do you care how you take the initial pictures? Do you mind the quality of the resolution and the lighting?
Amit Jain:
Garbage in, garbage out, still holds.
If you're taking it in really poor lighting or if you're taking in lighting that is really changing drastically, people are coming in and out, that sort of thing, all of those things have some mitigation, by the way, right? Way better mitigation than what photogrammetry would do.
Photogrammetry would fail in those situat ions, by the way. It’s more forgiving than anything that has existed before, but still.
Let's say the biggest aggressors can be blur, so if you're capturing with the camera and your shutter is really long, when would that happen? You're capturing in a really dark environment, and you're using your phone. Your phone would try to be helpful and give you a really long shutter, but what the problem with the long shutter is you can have a really blurry image.
When we are trying to minimize that objective function of getting it to render photo-realistically, the thing is, if you have a lot of blurry images, it's just going to produce a blurry output because you're defining what you want in terms of the images you're showing it.
If the images you're showing it are blurry, that's what you're going to get.
But there are things you could do there, especially in an ML pipeline, that are very hard to do in a traditional pipeline. You could teach the network to model blur, because blur is, again, an optical phenomenon, so you can teach it to not blur, and you can get good results, better results than what you would be able to get without it.
The same things happen with good lighting; so if you are in a situation where you have really harsh lighting that is blowing out some part of the images, it's going to be just really bad. Doesn't matter if you're doing photogrammetry, it doesn't matter if you're doing NeRF, or if you're in a really dark environment... Actually, a dark environment is very interesting, we should talk about that.
What you could do then, if you're in a more nominal setting, like your home, photogrammetry works fine; you’re going to get somewhat blown out textures, not blown out, but somewhat dull textures, things like that.
NeRF will also give you somewhat dull textures, but you could do optimization on it and try to recover better quality images and better quality outputs from it, but it's just going to be a bit more forgiving.
I think that's, in terms of just lighting, if we are talking about it. Now, in terms of capture methodology, that's very interesting. You could just use the phone, and that's what users do today.
Marc Petit:
What is the output of the process, and how editable is it, because there are use cases where you just want to view in 3D, but the reality is, sometimes you want to art direct the content or add and modify it.
Do you get a bunch of editable data at the end?
Amit Jain:
If you're looking at the text representation of a PLY file, that gives a very simple representation, you just have coordinates for every point, and then you have indices for how to make faces up. Still just numbers. And you have software that interprets those numbers and gives you a representation that looks visual, it's not that different when it comes to NeRF representations, just that the representations are different from what we have had before.
There's another nuance to this.
Meshes are technically explicit representations, is what they're called. What that means is you can grab a point in, say, Meshroom or Blender, whatever you're using, and you can move it. Suddenly, you have edited the mesh, which is very cool. Most of the discussion I have seen around the editability of NeRF is, "Oh, it's a neural network; how the hell do I edit it?" It's mostly because no one has actually done the work of creating editable representations out of these things.
Here's one example of an implicit representation; NeRF is an implicit representation, SDFs, right? SDFs have been around for a very long time, and there's this really incredible tool that is all about editability, it's called Dreams, from a really fantastic team, which includes Alex Evans.
Dreams is one of the best experiences ever editing in 3D, and you're just doing your CSG operations in SDF. It's also a bag of numbers, you're basically editing a function, but you have all the editability, and now there are some tools like Wamp that try to do SDF editing in the same way.
In fact, Blender has sculpting, and sculpting is closer to SDF editing than it is to mesh editing. The better question, in my mind, is to ask about editability, about consumability, is not whether it is meshed, whether it is some implicit representation; it’s “how capable is that representation?” How much does it allow a person to alter it, and how easily it allows them to alter it?
We all know meshes are known to be very notorious for, when it comes to editing, programmatic editing especially, it's an unsolved problem. Let's say you want to segment a mesh, you want to separate out a thing from another thing in a mesh. It's a nightmare. How would you make it watertight again, or what comes in the place? All that stuff. Those problems exist, so I'll give you my philosophical answer to it, first.
In my mind, what happened was before we had GPUs, graphics processing units, and Analog SIFT use, we had graphics cards. There were these very fixed-function things on which we had some acceleration structures to make rendering fast.
These acceleration structures were meshes and some long shading in that area, but mostly not even shading a lot of it.
Anyway, these acceleration structures got to hold, and if you're familiar with the hardware lottery ticket hypothesis, which is once you have built hardware for representation, even if there are better representations that come later, you are going to keep using the older representations, because the hardware is accelerating it and the benefit of hardware acceleration is so large that it's insurmountable to some extent.
For the entirety of the 2000s and 2010s, we were in this era of a lot of fixed-function hardware in all of these commodity devices, and meshes were the best way to represent it. But that is changing very quickly.
Now, you could easily write a KUDA Shader, a Metal Shader, or a Vulcan shader that approaches the performance of what fixed-function passes are built into the hardware. Maybe not quite there, maybe not everybody can write that good a function that the engineers at Nvidia or AMD or Intel or Apple have done, but you can almost get there; now that time is changing.
My belief is we don't need to bring the complexity of acceleration structures and make artists deal with that complexity, make artists think about watertight meshes, make artists think about this simulation, make artists think about self-intersections in meshes, make artists think about multi-level meshes, make artists think about all of these complexities, and I'm not even going to material yet.
Think about this, you are an artist, you're going to paint things, and I'm telling you, "No, you can't; you need to first learn about the composition of paints. You need to first learn about the chemical reactions that go into watercolors before you can actually do anything about it."
This is what we make artists do today in 3D, and this is what the tools are about. They have to expose all the complexity of GPU installation structures and make people learn them. It's time to change that.
Imagine if you had to learn about the complexities of a Word document's internal structure to write your document. The era of desktop publishing would never have happened. That's how 3D is being held back at the moment. The era of this massive mass-scale 3D, billion-person 3D, hasn't happened because we are making people learn about the internal representations of these things. It’s that hard to use.
Not to side skirt your question, coming back to the representation and actually, what can we do?
NeRF representations today look like neural networks, but especially our team, and a lot of really good researchers outside our team in academia, have done so much work on either extracting meshes from these representations because that's how it should be. Mesh should be just an export format.
Games use meshes. Fantastic. It's an exploration structure, you should just export it, right? Export great meshes out of it, and you can use them again. Export really low poly meshes out of it, high poly meshes out of it, export meshes that adapt to particular engine specifications.
It should be a compiler pass; it should not be a human who has to do that work. You could export SDS out of NeRFs because, again, it's a field. It’s a continuous representation.
Let's talk about what is a NeRF. What does it mean? Neural radiance field, right? What is a radiance field?
Radiance is this concept in graphics where it is basically, what is the light admitted at this point? That's the measure of radiance. It comes from physics, radiance is a very common phenomenon, and the main definition comes from there. A field means everywhere in the universe.
So, a radiance field is if you were to measure at any point how much light is emitted. Technically that's a light field, so the integrated version of a radiance field is a light field.
For simplification's sake, you're saying each point in the world, and how much light it contributes and that's the continuous representation you have. Sorry, it gets a little bit technical there, but the point is it's a very continuous representation. Then you can run algorithms on it to get whatever output you want. You can get different kinds of excitation structures, so we have been doing a lot of work in getting just NeRF rendering to work very efficiently on any commodity device, and we are getting very close to it.
Just a couple of weeks ago, we announced full NeRF rendering working on iPhones, working on commodity laptops, Macs, and whatever you have, with transparencies, translucencies, and this full photorealistic effects. That is an excitation structure that we built that allows us to do that.
File format, in my mind, should be transparent; people should not have to think about file formats. It should be, whatever the engine is supposed to consume, you write a pipeline to it, and it exports, and that's it.
Marc Petit:
There is definitely something there that people collaborating with that will be on their SDF modeler, and I think we looked at the connectivity between SDF models to Ninite, for example, and creating a path, tools to get the performance.
Do you see a path? Do you see similar paths to get NeRF data into any sort of game engine technology?
Amit Jain:
Game engine integration can go in two ways.
One, you want to run it in the game and NeRF's a little bit further away from that.
But the second is just rendering NeRFs in an Unreal Engine and Blender, in Unity, in Octane, whatever have you, for using them for all the different use cases that game engines have. It's very trivial. It's absolutely trivial to get it to do. We got Unreal Engine rendering working with a new renderer in about two days. Now, we are just polishing it, then we'll release it in almost no time.
It's very trivial.
You have to work around the engine's idiosyncrasies, and then if you're clever, you can also get most of the engine’s interactivity to work.
What I mean by that is, if you bring in traditional 3D objects and have them interact with NeRF, that could work easily.
I can speak to Luma's pipeline, which, we extract really good meshes out of these, and then you get full interactions, shadows, intersections, that sort of things, so you get collisions, you get physics, and it's not the NeRF that is actually colliding with anything, because, well, you could, you could actually run really, really expensive shaders that have computing collisions at every single point, and I don't think that's very good.
You should just have a proxy representation that is able to do all that stuff. That's actually a trivial problem. It's more a problem of integration and how to make a good experience out of it than it's a problem of technical feasibility.
Marc Petit:
We talk a lot about the limitation, but already if you were able to go on a movie and capture the movie stage, and bring it such that you can do pickups in future production and just this ability to reproduce a location with a high level of accuracy, visual fidelity, I mean that, right there, there is a huge value, especially if you can integrate it into the production pipeline down the road.
Amit Jain:
Talking about movies and content creation, I think that's a very interesting angle to look at.
If you look at the history of 3D, most of the time, people have thought about games, and most of the time, people have talked about interactive things that you could have in a browser. Now, especially with WebGL and hopefully WebGPU soon, we will have really, really good 3D experiences in the browser.
But there's this whole industry that uses 3D for not game users, for not these kinds of users, the movie industry, where they spend an immense amount of effort on capturing, reconstructing, and editing with artists, to create these end-to-end experiences where movies can record into it. For instance, Shang-Chi, they have a really great breakdown of all the work that they did.
There's this scene where the bus is going down San Francisco, I think it's Lombard Street, but I might be wrong about that with the particular street. The bus is going down, and to shoot that scene, you should absolutely check out that blog post, and we should post a link to that; it’s really incredible. [Editor’s note: You got it, Amit!]
The scene is entirely shot, the bus is actually in the studio mounted on hydraulics where all the movement is being done, and everything is happening there, and now they have to composite that into San Francisco, with the bus colliding, with all the things around it, with the bus colliding and then interacting with everything in there.
They reconstructed the whole scene. It took them an inordinate amount of effort to actually get to a level of quality where you are convinced that this bus is actually going down in San Francisco.
Where NeRFs are super interesting is that where the strength lies isn't photo realism; instead of doing all that work of capturing with photogrammetry, giving artists those meshes and having them produce the textures and fixing the meshes and months-long iterative cycles and a lot of processes, you can actually get to a place very soon, where you capture the scene and just move the camera in it virtually. And think about the possibilities of storytelling.
To me, what is really exciting, is not that just big production studios can do this cheaply. To me, what is exciting is someone with an iPhone can now do this maybe 80% of the way. Maybe not make, of course, full production value, maybe not make an Avatar movie, but someone with an iPhone can make things and re-shoot. I think that is very, very exciting.
We call this idea re-shooting. What is really fun about this is you go capture it once, then you can move the camera in it as if you were physically there again and again, and you can take different shots. You can use it for Previs, you can use it as an asset library that you have.
It's like, "Oh, I want to shoot this scene there, there, there," and composite altogether, make those building blocks. I think NeRF is a very strong building block for these kinds of things, where the goal is visual fidelity, where the goal is to make something that looks as photorealistic as possible without actually having to do the work of making photo-realistic 3D representations.
Patrick Cozzi:
Thinking about that consumer with that iPhone, do you see this as the ultimate democratization of 3D?
Amit Jain:
We wouldn't exist if that's not what I believed.
I think, finally, not just with NeRF, actually, that's worth thinking about. NeRF alone is a very interesting technology, but it's not sufficient. There are a lot more pieces that need to be built and that need to be integrated together.
We sit at this cusp now where it is possible to build tools that not 100,000 artists or a million artists can use, but I believe a billion people can use, where we can make 3D almost as easy to work with as photos and videos now. Photos and videos have really gotten to a place where anybody with a smartphone can at least make some level of edits, and that's kind of huge.
If you think about where we have come from, from a photography perspective, people used to have dark rooms. That's the age we are in with 3D at the moment; it's that procedural, it's that manual, it's that involved, and it's that expensive.
I believe not too long from now, we would be thinking about it, like, holy shit, people are creating the kind of things that would've taken a studio. One person would be able to do that kind of thing; I very truly believe that.
Marc Petit:
It's a good segue into the other vector of democratization, which is generative AI, and Luma launched Imagine 3D, which is basically text to 3D.
So, two questions, how does this relate to NeRFs, if any relationship, is there a shared core technology right there? And my other question would be, what's the ultimate potential, how far can we take that text to 3D technology?
Amit Jain:
I feel like text-to-3D or text-to-image is the least imaginative and least interesting use case for generative technology.
I think this is the one we have working right now because we have a lot of image-text pairs on the internet, so you have to, in ML, the data you have is kind of your limiting factor.
Imagine 3D. We are doing a lot of work when it comes to generative modeling, generative technologies, especially when it comes to 3D.
The goal here is not just to, okay, you type something, and you get something random out of it. I believe that's actually a really poor workflow. Most artists, most people who have... Okay, if you're not just experimenting or toying with it or playing with it, which is fine; actually, a lot of technologies are really good for playing with, and that's how they start, and that's how it should start. You just try to play with it and have fun with it, and that's fantastic.
But when you're talking about someone who is an artist or a game developer or they're making something for product shots, whatever have you, 3D is widely used today, right? Again, and I believe it can actually go much further, but anyway. When someone has this, they have something in mind they want to accomplish, they want to look, they want certain kind of object, they want certain kind of appearance, whatever have you.
Traditional tools are entirely procedural, and it's upon you to actually get there from a blank canvas or something that you started with, a template, or whatever have you.
With generative modeling, what is really fun is it lets you imagine a billion variants of it, and you can choose, "Oh, this probably, that's right." But that gets tiring very quickly, and so you see this new era of ControlNet and other tools, where you want to give people some level of control of, "Okay, I have this thing, it's almost there," let's say on the spectrum zero to one, it's almost 0.7% or 0.7 there or 70% there, but I want it to be a little bit this, I want it to be a little bit that. How do you express, "I want it to be a little bit this, I want a little bit that?" That's what needs to get built.
Currently, ControlNet and stuff look advanced, but it's also very primitive still. You're saying, "Oh, I'm going to draw something and try to see if it can get close to it." I think that artistic work, the tools that we need to build, I think is the most fun part of generative AI. Because see, I was saying early on, NeRF has the potential or ML graphics have the potential of allowing people the liberty to not have to think about meshes, to not have to think about materials.
That only is possible if you're able to build these kinds of tools regeneratively. Declarative tools are what we call them within our team.
The example I gave you of Dreams, their founders have this notion called sensitive tools. I think they're the same things. Declarative tools are, they don't make you tell it how to do it, they make you tell it what to do, and they accomplish it.
Now, what should not be so coarse, where you say a Llama riding a capybara. I don't know what that would look like, but I want this to be brighter, and then you should be able to express it in as much precision as you want. Those are the tools that we need, and I think with generative modeling especially, the pieces are starting to get there where you could build those kinds of tools.
That's the work we are doing, where the capturing is just for us; the goal is not to build the next capture app or things like that. We do capture because it is one of the most efficient ways of creating 3D. Our goal is to build that suite, that end-to-end tools, where someone can realize whatever they have in their mind, they can actually make it into 3D, and they don't have to do an apprenticeship at Pixar or DreamWorks for five years. That's our goal.
On that journey, NeRF is a very important primitive; it’s a good representation, but you need to marry it with generative modeling to solve a lot of critical problems, a lot of critical problems like dense captures, a lot of critical problems like quality issues that you have. We have this giant roadmap of how we will do that, how we'll gather data for it, all the things that we need to do, and a combination of these is actually very powerful.
We approach these problems from a very practical perspective.
It's like, "These are the problems we're seeing. How can we actually make them better? What are people trying to do with it, and how we'd actually make that better?”
Generative modeling, I believe, currently there's also this undercurrent of artists versus AI, and I think there are other connotations to it, which I'm not going to get into, like how do you get the data? There's that part of it, but I believe on the other side of this, artists think that it will replace them. That is the problem. They won't care if this was like, "Oh, this is making a better tool for me, so I can do my job better." But the current way that people have positioned generative AI is that this is going to replace artists.
No, this is the same as when we got compilers for programmers. People thought, "Oh, compilers are going to generate code; they’re going to replace me." But that didn't happen. Think about how many more coders there are now than we had in the 1980s because then they had to go learn assembly, they had to go learn Fortran, they had to go learn all the intricacies of each hardware.
It was very hard work. Not hard work in the sense of you do a lot of work, it was just an insane amount of work. Not many people could do programming back then. Now, orders of magnitude, more people can do programming; it’s a whole tradecraft.
That's what we want to make 3D, so people who are impressive artists today can make things that studios make, and people who can't make anything today, people who, "Oh, I want to make 3D," they go look at Blender tutorials, it's like, "I don’t want any part of this," they drop out.
We want to bring all those dropouts into the world where they can create really amazing things, so that's the potential that generative AI and ML-based graphics have as a combination where it can give artists tools that, holy shit, would've taken a whole studio, could not even be possible before, that they could create.
I'm just repeating myself at this point, but it's very important to understand that is the goal. That is where we are going.
Patrick Cozzi:
That's a very inspiring vision; we want everyone to be 3D creators.
You spoke a little bit about only being as good as the training data, and I was wondering if you could talk just about any of the concerns around biases or maybe even some potential copyright infringement.
Amit Jain:
There has to be a middle ground. It cannot be an extreme one way or the other. It's like, "Oh, we just don't do anything here," because clearly, the technology is very powerful, and it'll have a huge impact. The answer should also not be, "Hey, permissionlessly just take anything, everything we can grab," because that's anarchy, and that's just not good either.
Speaking of Luma particularly, for us, we actually capture a lot of real-world data ourselves. Immense amount, to be honest with you. That is one of the reasons a lot of really, really impressive people have chosen to join our mission. You can have this mission all day, but to accomplish it, you need some processes in place, and one of them is data.
This is a very clean dataset, the kind of things that we are working in, in this area, and our goal is photo-realism, so a lot of data that's out there is just not photo-realistic, also not very interesting to us from that perspective.
Overall, speaking as a field, I believe we have work to do, honestly. I think both sides have valid opinions, really. There has to be a context here where you can declare that your image should not be used for training purposes, as simple as that, to going as far as saying if it were used, it should not be allowed to copy my style or people should not be able to just say, "This in the style of this." I think listening to that.
Imagine 3D currently does not try to do anything there, Imagine 3D is just an experiment from us of learning 3D modeling, learning the journey of 3D, that sort of thing.
That's what we say up there as well, and we have not deployed it widely for some of these same concerns. It's because we have to figure out how to actually do this. First of all, just technically, how do you produce really high-quality results; texture 3D is just not there yet.
Second of all, how do you do it in such a way where it works for people, it works for artists, it works for the whole ecosystem of it, so if you were expecting an answer for me, as you know, I don't have one.
Patrick Cozzi:
Wanted to switch gears a little bit. Mark and I do a fair amount of work with the Metaverse Standards Forum, trying to support and promote an interoperable 3D ecosystem.
We want to ask your crystal ball, how far do you think we are from adding NeRF primitives to glTF, USD, or 3D Tiles?
Amit Jain:
I have two opinions there, and they're actually contradictory.
The first one I have, it's too early. We should not standardize this yet. If we do, we are going to end up in a local minimum at best. We should let the field play out. Just look at the velocity of change in this area. You don't try to gel velocity like this or change this into a standardized format when this is happening. That's the first opinion I have.
Then the second one is that we do need a way to interoperate; we do need a way for engines to be able to talk to each other from a thing you produce here, take it there, and then combine and change the tools because that is basically how all creative work is done, so there's that.
These two are contradictory to each other.
From Luma's perspective, what we are doing, and then we haven't quite announced it yet, we are going to basically make that easier, not as a Luma file or whatever have you, we are going to make that easier for people to bring it into anywhere and integrate with it. It can still be suboptimal compared to just USD or glTF standards, but I think at this moment, the emphasis should be on how can we actually just advance the medium as much as possible.
Standardization just runs counter to it at the moment, and that's the sad truth of it. But that doesn't mean we should not be thinking about standardization because it is possible that you could end up in an era of just 100 "splintered standards," and that will be very poor as well.
One thing really good about neural networks is that they are just a bag of numbers. You could have arbitrary size neural networks, you can have different widths, and different things in there.
A standard that mandates the architecture of the neural network, I think, is a big problem.
If someone tries to do that, where they say, "No, no, no, this has to literally be this many layers, this is the width of the layer, and then it has to be a unit, or it has to be an MLP, or it has to be this," that will be a really problematic place to be.
We could come up with an architecture where, "Hey, you could put whatever the hell you want in there, and we have an engine that is going to be able to run inference by taking the model file," and that is an approach that Apple has taken, TensorFlow has taken, and others.
For instance, on Android or on iPhones, you have neural engines. Apple calls it the Apple Neural Engine, ANE. Google calls it the Tensor Cores, same thing. By the way, mind you, it would be very efficient, very, very good, to bake down the entire architecture of the neural network, the weights, into the silicon, and then it'll be virtually free to do inference, and you could do it at 100 FPSs if you wanted.
That's the problem, you don't want that.
The architectures are evolving. What they try to do is like, all right, we're going to build these engines. They're just really, really good matrix multipliers. GPUs are good matrix multipliers, but you can then go two steps further and say, "Look, I'm going to drop all rendering hardware, I'm going to make it go back into the era of graphics cards, fixed-function." So, these are not quite fixed-function, but they're really, really, really good linear operators and matrix multipliers.
You can run a lot of networks on these devices very efficiently.
Not as efficiently as if it was baked in the hardware, but quite well. We should learn from there, and we should take that approach, where it can be a standardized file format where it could be any architecture, and then we don't write codec level things where like, "Oh shit, no, the impact has to be standard, entire data layout, this is the header, this is the structure, these many bytes," all that stuff. We don't do that yet. Maybe in the future, but we don't do that yet. We do, "Okay, you give us the neural network; if it's inefficient, it's your problem. File coming from your source is bad, then the customer should blame you."
Of course, it's not good for game engines that want to import that thing, and the game is going to come to a grinding halt, but we can build guardrails in game engines for that, right?
There are very good technologies that measure how much compute you're giving to a particular asset and then take away if something is killing that, or something is not rendering it, refuse to render on one extreme, or on the less extreme side, you say, "Oh, you get to render only once every two frames, or once every three frames or I'm not going to give you the compute budget, too bad, I'm going to render you to lower resolution," whatever have you.
They're clever ways of doing it in such a way that this doesn't lock the field out, and we still have some level of standardization, and then we can advance the interoperability, bring it everywhere we want to.
The solutions have to be new-age solutions, they cannot be MPEG-era solutions.
Marc Petit:
You had a comfy job at Apple; what got you to jump and create your own company and start Luma?
Amit Jain:
Even at Apple, the projects were extremely interesting, and we worked insanely hard, so it was not comfy, but I know that's not what you asked.
Marc Petit:
I’ll rephrase my question, what's the one thing that defines the Luma culture?
Amit Jain:
I think we tend to experiment a lot. That's one big thing.
Ideally, I think it's ill-advised to start a company like Luma, to be honest with you. If I was a VC, I'd be like, "Holy shit, you're doing so much research, why don't you just take it off the shelf, technology, productionize, sell, sell, sell. That's what you should do."
It's a bit ill-advised in the SaaS era, but then there are some people who are like, "Holy shit, if you're successful, if you're able to make this happen, the world will look very different."
Our partners are like that. Everyone who has chosen to join Luma is like that because they didn't join Luma because we were able to, like, "Oh look, this is our exact sales strategy; this is what we are able to do." We are getting there, and we have to get there, but we were not there even last year when we raised our seed and our Series A recently.
Both of those were raised on this idea of, "Okay, if we are successful, we will be able to basically change the field of 3D graphics, but also the field of how content is created."
I believe not just that 3D graphics are good for 3D content, but they're also one of the most powerful ways of creating 2D content, so one value that runs through Luma is this value of experimentation. In our product planning meetings, sometimes there are walls around those meetings. They’re like, "Okay, this thing is not solved. We don't know how to solve it." And sometimes it's like, "Okay, the solution is to go solve it," and then that's what we'll do.
The second part of it is really hard work, an insane amount of work. Insane because experimentations are inherently inefficient; not all of them pan out, but you make up for it with hard work.
I've worked with some really, really incredible people in my life, and always the thread has been, you look at them and say, "How did you do this? How did you possibly do this?" The answer almost always inevitably is they work like crazy. They work really hard. Of course, over time, their intuition gets so good that they get to avoid a lot of pitfalls that a newbie in the field would run into.
I think they're getting good at that as well at Luma, where we get to avoid a lot of pitfalls of some papers coming out and things like that. We understand what will work and what won't work to a certain extent, but it's experimentation; it's hard work.
What led me to decide to leave and start a company versus working at Apple is when you're doing this kind of stuff, the uncertainty is high. In a big company, I'm not talking specifically about Apple, I'm talking about every big company, you need everyone in the chain of command to buy into it and say yes.
One person says no in that 60-level high pyramid, and the thing is dead. That's it. Unless a person above them says... It's like in politics, right? One person has to say no for the thing to die.
With a start-up, one person has to say yes. For us, these are our partners at Matrix and Amplify, one person had to say yes, and then you get to actually try it. If you have conviction, if you believe, holy shit, this is something big, it'll take a lot of work, we can do it, that's the way to do it, so Luma is that startup.
Patrick Cozzi:
Now, that's an awesome culture. I think it's pragmatic, lets people think big, and lets people innovate.
Congrats on that.
So, Amit, to wrap things up, we love to ask for a shout-out to a person, people, or organization.
Amit Jain:
I would definitely like to give a shout-out to the research community. It's insane. We wouldn't exist today without open source research and just the incredible amount of work that is going on in this area, so specific areas like Berkeley AI Research, Bayer Institute, NeRF came out of Berkeley, for instance.
Original authors on that, Matthew Tancik, Ben Mildenhall, and Jon Barron, a lot of work came out of his lab. Just incredible people.
Now the generative AI side of Berkeley, there's this whole culture of researchers just working in open and sharing and that sort of thing.
With Luma, we absolutely aim to do that. Early stage startups, it's very hard to also publish and share because you're just trying to survive in the moment, but we have great plans to contribute back to that ecosystem.
Most certainly the research community, I think everything that you're seeing in AI, everything that is happening, is happening because a lot of really, really brilliant people did their life's work and decided to share it, and I think that's very cool.
Marc Petit:
Thank you very much, Amit. It's a perfect way to conclude this podcast.
Thank you very much for sharing your enthusiasm, your passion, and your vision for these new technologies. We wish you the best of luck with Luma AI.
Thank you, Patrick.
I want to thank our audience as well. You can reach us on our website, buildingtheopenmetaverse.org. Email us at feedback@buildingtheopenmetaverse.org, and we'll be back soon with another episode. Thank you very much, everybody.