Note
This README was created with AI help to give you clear setup and usage instructions for sam-3d-objects.
This repository is based on the original code at https://github.com/facebookresearch/sam-3d-objects and the paper https://ai.meta.com/research/publications/sam-3d-3dfy-anything-in-images/.
The original README file has been renamed to _README.md.
To check the full commit history, see https://github.com/PARKCHEOLHEE-lab/sam-3d/commits/main/.
To set up the SAM3D environment in Vessl, set the Custom Image to docker.io/cjfl2343/sam-3d:0.0.2. This image was made for this project. The Docker image comes from the Dockerfile.sam-3d file in this repository.
Since this image uses CUDA 12.1 and SAM 3D requires at least 32GB of VRAM for multi-object inference, it is recommended to use a node eve-s01, space-01 or similar node.
To get started with sam-3d-objects, first clone this repository:
This will create a folder named KOCCA-SAM3D with all necessary source code and scripts.
git clone https://github.com/KAIST-VML/KOCCA-SAM3D.git
cd KOCCA-SAM3DTo set up the environment and pre-trained models, run these scripts in order:
-
Install dependencies:
bash setup_a.sh
Even if you see the following messages after dependencies are installed, you can ignore them and proceed to the next step.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. lightning 2.3.3 requires packaging<25.0,>=20.0, but you have packaging 25.0 which is incompatible. ipykernel 7.1.0 requires jupyter-client>=8.0.0, but you have jupyter-client 7.4.9 which is incompatible. vessl 0.1.199 requires google-cloud-storage<3.0.0,>=2.15.0, but you have google-cloud-storage 2.10.0 which is incompatible. WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
-
Request model checkpoints by providing some information at https://huggingface.co/facebook/sam-3d-objects:
-
Install pre-trained models (Once your request to access the model checkpoints has been accepted and your Huggingface token has been created):
export HUGGINGFACE_TOKEN=<your_huggingface_token> bash setup_b.sh
The main_inference.py script can generate either a single object or an entire scene from an input image using pre-trained model weights.
To generate a 3D object from a single mask, specify the image path and the index of the mask to use (--mask_index=N). For example, to extract the object using mask index 26:
python main_inference.py \
--image_path=notebook/images/shutterstock_modern_colorful_Interior_2620125197/image.png \
--mask_index=26 \
--output_dir=output \
--export_images=false \
--re_alignment_mode="none"To generate 3D meshes for all object masks in an input image and combine them into a scene, set --mask_index=-1.
Note that the sample data for inference includes pre-defined masks; setting -1 uses these masks. If you want to run inference on your own images without providing specific masks, see "Inference with Auto Masking."
python main_inference.py \
--image_path=notebook/images/shutterstock_stylish_kidsroom_1640806567/image.png \
--mask_index=-1 \
--output_dir=output \
--export_images=false \
--save_all_objects=false \
--re_alignment_mode="none"If you do not have object masks for your input image, you can use automatic mask generation by setting --mask_index=-2. Originally, sam-3d-objects did not support auto-masking functionality; therefore, a pipeline was added that automatically segments interior objects. The pipeline uses SAM 3 (facebook/sam3) to detect interior objects in the image before generating 3D models for each detected object. You can control the segmentation behavior through --sam_prompt, --sam_threshold and --sam_mask_threshold parameters.
python main_inference.py \
--image_path=notebook/images/_9gFNBQJmk9WmdYWtkwfo45/image.png \
--mask_index=-2 \
--output_dir=output \
--export_images=false \
--save_all_objects=false \
--re_alignment_mode="none" \
--sam_prompt="interior objects" \
--sam_threshold=0.4 \
--sam_mask_threshold=0.4The --sam_prompt parameter specifies what types of objects to detect (default: "interior objects"). You can also tune --sam_threshold and --sam_mask_threshold to control segmentation sensitivity.
The main_profile.py script benchmarks inference speed for one or more images and outputs per-mask timing statistics.
To benchmark inference performance on all images in ./notebook/images/ (approximately 230 total object masks across all images), run:
python main_profile.py \
--images_dir=./notebook/images/ \
--output_dir=./output/_profile/ \
--use_inference_cache=false \
--save_profile_summary=false \
--wait=0 \
--warmup=1 \
--active=3Inference was executed for approximately 230 individual objects. For each object, the schedule was set to wait = 0, warmup = 1, active = 3, yielding a total of four runs per object. Only the three active-step wall-clock times were recorded and averaged to produce the reported per-object mean. Timing was measured as the difference of time.perf_counter() after calling torch.cuda.synchronize() at each iteration to enforce GPU synchronization.
On an NVIDIA A5000 GPU (24 GB VRAM), the mean wall-clock time per single-object inference is 37.004264873904155 seconds. If we exclude model-loading overhead, the runtime is expected to decrease by approximately 20%. In this benchmark, the configuration was intentionally conservative: the model was reloaded on every run.
| mask_index | elapsed_time_at_active_step_001 | elapsed_time_at_active_step_002 | elapsed_time_at_active_step_003 | elapsed_time_average |
|---|---|---|---|---|
| 0_kid_box | 45.33452668134123 | 45.4912094604224 | 45.65570163633674 | 45.49381259270013 |
| 1_kid_box | 29.45958050340414 | 30.371026386506852 | 30.46039948984981 | 30.097002126586933 |
| 2_kid_box | 36.62481936812401 | 36.83841050881893 | 37.89146111905575 | 37.11823033199956 |
| ... | ... | ... | ... | ... |
| 0_shutterstock_1243680295 | 31.51510568056256 | 31.225410433486104 | 31.995900759473443 | 31.578805624507368 |
| 1_shutterstock_1243680295 | 51.53862490598112 | 51.73792759235948 | 51.409426456317306 | 51.56199298488597 |
| 2_shutterstock_1243680295 | 31.32706823106855 | 31.51491724140942 | 31.40297007188201 | 31.41498518145333 |
| ... | ... | ... | ... | ... |
| mean | 37.063137248682075 | 36.85683911495199 | 37.09281825807841 | 37.004264873904155 |
For multi-object inference, the pipeline still performs per-object inference independently and then merges the outputs into a single scene. The merging step increases memory requirements, so a GPU with at least 32 GB VRAM is likely necessary. Because multi-object scene generation time depends on the number of object masks in the image (
After merging per-object results, the script can re-align the scene geometry. The main options (--re_alignment_mode) are:
none: No transformation is applied (default).pca: Applies PCA to align the scene's main axes with the coordinate axes.obb: Uses an oriented bounding box (OBB) of the entire scene for alignment.obb+: For each object, computes its OBB and re-orients so that its bottom face is parallel to the XY-plane
The obb+ mode assumes:
- Each object's OBB bottom face is its true bottom.
- Object bottoms are nearly parallel to the global XY-plane.
- There are no "floating" (z-offset) objects.
python main_inference.py \
--image_path=notebook/images/_R022/image.png \
--mask_index=-2 \
--output_dir=output \
--re_alignment_mode="obb+" \
--sam_prompt="furniture" \
--sam_threshold=0.27python main_inference.py \
--image_path=notebook/images/_R022/image-iso.png \
--mask_index=-2 \
--output_dir=output \
--re_alignment_mode="obb+" \
--sam_prompt="furniture" \
--sam_threshold=0.27Note that the bounding boxes created using trimesh.bounds.oriented_bounds may have different local plane orientations with each execution.
The main_vsa_dataset.py script automates the generation of 3D models for both individual objects and complete room scenes from a hierarchical dataset. Processing is skipped if the isometric room view (*_IsoView1.png or *_Isoview1.png) does not exist.
python main_vsa_dataset.py \
--dataset_dir=./VSA_dataset \
--output_dir=./VSA_output \
--sam_prompt="furniture" \
--sam_threshold=0.27For each dataset folder, the script executes a two-stage pipeline:
-
Individual Object Processing
The script iterates through all object subfolders and generates a single 3D model for each object using its corresponding isometric view image.
Output for individual objects:
VSA_output/ └── R022/ ├── Object0/ │ ├── image.png │ └── object.glb ├── Object1/ │ ├── image.png │ └── object.glb └── ...
-
Room Scene Processing
The script processes the isometric room image using automatic mask generation (SAM with
--mask_index=-2). Multiple objects are segmented and reconstructed independently before being merged into a unified scene representation.Output for the room scene:
VSA_output/ ├── R022/ │ ├── Object0/ │ ├── Object1/ │ ├── ... │ └── Scene/ │ ├── 3d/ │ │ ├── object_000.glb # reconstructed object │ │ ├── object_001.glb │ │ └── ... │ ├── mask/ │ │ ├── mask_000.png # automatically segmented mask │ │ ├── mask_001.png │ │ └── ... │ ├── image.png # original room image │ ├── scene_before_realignment.png # merged scene before applying realignment │ ├── scene_obbs.png # oriented bounding boxes corresponding to all objects │ └── scene.glb # merged scene ├── ...















