SDSS optical color composite images

By utilizing approximately 124,000 optical color composite images and spectroscopic data of nearby galaxies at 0.01 < z < 0.06 from the Sloan Digital Sky Survey, along with follow-up spectroscopic line measurements from the OSSY catalog, and leveraging the Vision Transformer machine-learning technique, we demonstrate that galaxy images in JPG format alone can be directly used to determine whether star-forming activity dominates the galaxy, bypassing traditional spectroscopic analyses such as emission-line diagnostic diagrams.

Vision Transformer Base Architecture

Our approach employs a Vision Transformer (ViT) architecture to analyze galaxy images for two distinct tasks: the classification of galaxies into star-forming and non-star-forming categories, and the prediction of emission line ratios.

ViT Architecture for galaxy classification and regression

Star-Formation Classification

For classifying galaxies into star-forming and non-star-forming categories, we implement a classification approach that operates on the 768-dimensional [CLS] token representation from the vision transformer. The classification mechanism consists of a linear transformation that projects this representation to two output logits, corresponding to the two classes. The model is trained using cross-entropy loss, and during inference, a softmax function transforms these logits into class probabilities, where the probability for the star-forming class indicates the model’s confidence in the galaxy being star-forming.

Emission-Line Ratios

The regression variant, designed to predict BPT diagnostic line ratios ([NII]λ6583/Hα and [OIII]λ5007/Hβ), maintains the same base model but is modified for continuous output values. The regression head processes the [CLS] token through an initial dimensionality reduction from 768 to 384 features, followed by GELU activation. A dropout layer (p = 0.1) is employed for regularization. The final linear layer outputs two values, corresponding to the two line ratios. This architecture maintains sufficient complexity to capture the relationship between visual features and spectroscopic properties while preventing overfitting through dimensional reduction and dropout regularization.

Performance Validation on Unseen Redshift Ranges

To evaluate our model's ability to generalize beyond its training distribution, we tested our ViT architecture on an extrapolation dataset containing galaxies with redshifts outside the training range. While our model was trained on galaxies within the redshift range of 0.01-0.06, this evaluation dataset consists of galaxies with two redshifts bins at 0.06<z<0.07 (N = 45,402) and 0.16<z<0.17 (N = 23,026), where the SDSS and the OSSY database provide enough number of galaxy photometric images and spectroscopic data. This experimental design allows us to assess whether the learned features remain valid when applied to a data set that has never been exposed to the model.