CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Jingwei Xu^1*, Zibo Zhao^1*, Chenyu Wang^2*, Wen Liu³, Yi Ma⁴, Shenghua Gao^4†,

¹ShanghaiTech University ²Transcengram ³DeepSeek AI ⁴University of Hong Kong

(* denotes equal contribution, † denotes the corresponding author)

arXiv Code(Coming soon) Dataset(Released)

Example of Command Sequence Representation

A simple example about the construction process of a CAD model with command sequence representation.

Network Architecture

We propose a network capable of simultaneously processing up to three modalities of input data. Each non-text input is first processed through a frozen encoder, followed by a projection layer that aligns these features within a shared large language model (LLM) feature space. By integrating the prompt with the multi-modal embeddings and applying fine-tuning to the LLM using Low-Rank Adaptation (LoRA), our model generates accurate CAD models conditioned on the combined input data.

Our Dataset(Omni-CAD)

Qualitative Comparison: We exclude the CAD models’ IDs that have been included in the DeepCAD dataset for visualization. The extension part of our dataset contains more complex and realistic models with more details.

Example of the conditioned multimodality data and the corresponding ground truth CAD models.

Point Conditioned Generation Results

(Please view our paper for more results with more modalities conditions)

Qualitative comparison with B-rep point reconstruction baselines. Blue lines denote dangling edges, which leads to non-manifold structures.

Our model demonstrates enhanced robustness against noise and partial point cloud elimintaion compared to the baseline.

BibTeX

 @misc{xu2024CADMLLM,
      title={CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM}, 
      author={Jingwei Xu and Zibo Zhao and Chenyu Wang and Wen Liu and Yi Ma and Shenghua Gao},
      year={2024},
      eprint={2411.04954},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}