The Shape of Intelligence

Today we are introducing speridlabs.

We are an AI lab building Spatial Intelligence: foundation models that understand the 3D world, generate within it, and eventually reason over dynamic worlds that change through time. We start with 3D because the first thing AI needs to stop doing is treating the world as flat. But 3D is not the ceiling. The world is not just geometry. The world is time, motion, interaction, memory, constraints, uncertainty, and change.

Fig. 01.AFounding presentation

The world is not flat. Intelligence is not flat either.

Most of what we still call computer vision starts from a flat assumption: an image comes in, a prediction comes out. Detect. Segment. Track. Caption. Classify. Those are useful tasks, but they are not the real problem. The real problem is the world behind the image.

What is where. What is hidden. What changed. What stays the same when the camera moves. What would be visible from another viewpoint. What happens if an object moves. How a room, a street, a factory, a construction site, or a scene should extend beyond the pixels you were given.

Those are not image questions. They are world questions. And a model cannot answer world questions properly if it does not hold a world.

That is the bet behind speridlabs.

The missing layer

Computer vision and 3D generation are usually treated as different fields. One is about understanding what already exists. The other is about creating something new. We think that separation is wrong.

Understanding a world and generating a world depend on the same missing layer: a coherent internal representation. A model that cannot hold space cannot reliably understand it. A model that cannot hold space cannot reliably generate it either.

The field does not need another narrow model for another narrow visual task. It needs a foundation model that treats perception, reasoning, simulation, reconstruction, editing, and generation as different ways of interacting with the same internal world.

One model. One representation. Many queries.

Why space changes the problem

Images are easy to consume. Worlds are not.

An image is a rectangle. A world has geometry, scale, surfaces, occlusion, objects, constraints, memory, and time. That is why visual AI still breaks in ways that feel obvious to humans. Humans do not see the world as disconnected frames. We carry a stable model of the space around us. We understand that objects continue to exist when they are hidden. We know that a chair is the same chair from another angle. We can move through a room and keep the room coherent in our head.

That is not a small feature. That is the core of intelligence in the physical world.

A spatial foundation model should do the same kind of thing computationally. It should build an internal representation that stays consistent across viewpoint changes, partial observations, occlusion, edits, generation, and time.

Once you have that, the interface changes. You do not need one system for reconstruction, another for segmentation, another for editing, another for navigation, another for generation, and another for simulation. You ask different questions to the same world.

Reconstruct this space. What changed here? What is behind that object? Extend this scene. Remove this wall. Generate a consistent version of this environment. Simulate what happens next.

The task changes. The underlying world remains.

3D is the start, not the limit

We start by solving 3D because it is the first concrete layer where the flat assumption breaks. A model that understands geometry, viewpoint, scale, occlusion, and spatial consistency is already operating on a different unit than pixels. It is not just reading an image. It is beginning to hold a world.

But when we say 3D, we are not saying the world is only 3D. The real world is dynamic. Objects persist, move, deform, collide, disappear, reappear, get occluded, and change state. A true world model cannot stop at static space. It has to understand space through time. In that sense, 3D already implies 4D.

And even 4D is not the full abstraction. The deeper point is not the number of dimensions. The point is that intelligence for the physical world needs to represent state. Geometry is state. Time is state. Motion is state. Semantics are state. Physical constraints are state. Memory is state. Uncertainty is state. Actions and consequences are state.

Spatial AI starts with 3D because that is the first necessary break from flat intelligence. It moves to 4D because the world is dynamic. And over time, the same idea extends to higher-dimensional representations of the problems themselves: richer world states that can be perceived, queried, edited, simulated, and generated.

Why this is hard

3D is not a single problem. It is a stack of problems that interact.

The first is representation. Language has tokens. Images have pixels. In 3D and 4D, the right representation is still not obvious. It has to be compact enough to train on, expressive enough to carry geometry and change, stable enough to reason over, editable enough to be useful, and scalable enough to become a foundation model. Representation is not an implementation detail. It decides whether the model is actually thinking in space, or only imitating space.

The second is data. The internet is full of text, images, and video. It is not full of clean, consistent, structured worlds. That changes the learning problem. You cannot just scrape the web and hope the model discovers spatial intelligence by accident. You need systems that can extract structure from real captures, generate coherent worlds, validate geometry, mix real and synthetic data, and avoid teaching the model shortcuts.

The third is evaluation. In spatial AI, looking good is not enough. A generated scene can look impressive from one angle and collapse from another. A reconstruction can look clean and still be geometrically wrong. A model can pass a 2D benchmark and still fail the moment you ask it to reason across space or time. The right benchmarks need to measure consistency, not just appearance.

The fourth is systems. Spatial outputs are heavier than text and images. They need viewers, APIs, formats, tooling, storage, rendering, editing, workflows, and ways to fit into existing production pipelines. A spatial foundation model is not just a model you serve. It is a new layer people build on.

Why now

The ingredients are finally close enough to move together.

Foundation-model training culture exists now. Generative models are strong enough to act as priors. Compute is still expensive, but serious work is possible. Demand is no longer theoretical: every industry is trying to automate physical workflows, and 2D perception hits a wall as soon as consistency across viewpoints and time matters.

But the deeper reason is this: the next interface for AI is not only text, image, or video. It is space.

Image models produce images. Video models produce motion. Spatial models produce scenes you can walk into.

That is not just better 3D. It is a new medium for design, storytelling, simulation, robotics, construction, games, infrastructure, virtual production, and any system that needs to reason about the physical world.

And it needs both sides at once. It needs imagination: the ability to generate plausible worlds, variations, scenarios, and futures. But it also needs grounding: the ability to lock onto real structure, real geometry, real constraints, and real measurements when precision matters.

In the physical world, plausible is not enough. The model has to dream and anchor.

What we are building

We are building toward foundation models that understand the world, generate within it, and expose that world as something developers, researchers, creators, and companies can use.

The first step is not a flashy demo. It is spatial consistency. The model has to learn geometry. It has to understand viewpoint. It has to reconstruct from real inputs, not only clean captures. It has to work with the messy way people actually capture the world: phones, cameras, drones, indoor scenes, outdoor scenes, blur, reflections, partial coverage, repeated textures, imperfect data.

The output should not just look good from the training view. It should stay coherent when you move.

From there, generation becomes more than text-to-3D. The point is not to generate isolated assets floating in empty space. The point is to generate worlds that extend coherently, maintain state, and can be edited, queried, simulated, and reused.

Eventually, the same foundation should support understanding, generation, simulation, planning, memory, and dynamic prediction. Not because those are separate products, but because they are separate questions to the same internal world.

That is the layer we think is missing.

A new unit of work

The important shift is not only technical. It changes what people build.

Text models changed the unit of language work. Image models changed the unit of visual creation. Spatial models change the unit of physical-world computation. The unit stops being a frame, an asset, a detection, or a pipeline. The unit becomes a world state.

That world state can be reconstructed from reality, generated from imagination, grounded with measurements, extended through time, queried by developers, edited by creators, and used by downstream systems. It becomes a primitive that many fields can build on.

This is why Spatial AI is bigger than one task, one product, or one demo category. It is a foundation layer for problems where the model needs to understand not only what something looks like, but where it is, how it changes, what it implies, and what could happen next.

This space is early. We want to help give it traction beyond our own work and bring in people from other fields who should care about it: researchers, engineers, creators, designers, roboticists, simulation people, game builders, infrastructure teams, and anyone trying to make AI understand the physical world. We think open source, open intelligence, and open knowledge are the path forward. The field needs shared models, shared benchmarks, shared tools, and shared language if it is going to compound.

We are backed by Pear VC and Base10.

This is spatial intelligence.