MolVision: Molecular Property Prediction with Vision Language Models

1NIT, KKR 2CRCV, University of Central Florida
*Indicates Equal Contribution

What is MolVision?

MolVision is a novel benchmark designed to study property prediction in molecules, integrating skeletal structure images with SMILES representations. It evaluates the performance of Vision-Language Models (VLMs) in predicting molecular properties across diverse datasets through zero-shot, few-shot, chain-of-thought and fine-tuning scenarios.

MY ALT TEXT

Introduction

Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance.

Molvision -- Charateristic and Statistics

Characteristics of MolVision

  1. Multimodal Integration: MolVision combines skeletal structure images with SMILES representations for molecular property prediction.
  2. Diverse Datasets: It includes Ten datasets covering various molecular properties and complexities.
  3. Evaluation Scenarios: Assessing Vision-Language Models (VLMs) under zero-shot, few-shot, Chain-of-Thought and fine-tuning conditions.
  4. Comparative Analysis: Benchmarking Two closed source and Seven Opensourced different VLMs to analyze their effectiveness in computational chemistry.

Statistics of MolVision

Category Details
Number of Datasets 10 datasets, BACE-V, BBBP-V, HIV-V, Clintox-V,Clintox-V, Tox21-V, Esol-V, LD50-V, QM9-V, PCQM4Mv2-V, Chebi-V
Dataset Composition Includes skeletal structure images and corresponding SMILES strings
Model Evaluation Two Closed Source and Seven OpenSourced Vision-Language Models evaluated
Performance Metrics Measured across zero-shot, few-shot, Chain-of-thought and fine-tuning scenarios

Overview of prompt used for VLMs: We show template prompt used for property prediction, including general outline, task instruction, in-context learning (ICL) examples (k=2), and an image prompt. The prompt guides the prediction of properties based on SMILES and visual representations.

Plot 1
Plot 2
Plot 3

  • MolVision overview: Average performance comparison of models in zero-shot (ZS), in-context (ICL), chain-of-thoughts (CoT), and finetuning (FT) for classification (Left ↑) and regression tasks (Center ↓). (Right:) Impact of using visual information on model performance (↑) (JanusPro).

Qualitative Result Example

First image description

A prompt example and its prediction result for Clintox-V Dataset.

Second image description

A prompt example and its prediction result for HIV-V Dataset.

Third image description

A prompt example and its prediction result for Tox21-V Dataset.

BibTeX

BibTex Code Here