Description
Add Beit to SMP
BEiT-3 is a general-purpose multimodal foundation model developed by Microsoft that excels in various vision and vision-language tasks, including semantic segmentation. It employs a unified architecture with Multiway Transformers, enabling both deep fusion and modality-specific encoding. Pretrained using a masked "language" modeling approach on images ("Imglish"), texts, and image-text pairs, BEiT-3 effectively models images as another language. This design allows it to achieve state-of-the-art performance across a wide range of tasks, such as object detection, image classification, and semantic segmentation.
- Achieves top 1 results on ADE20K-val
Papers with Code:
https://paperswithcode.com/paper/image-as-a-foreign-language-beit-pretraining
Paper:
https://arxiv.org/abs/2208.10442
HF reference implementation:
https://huggingface.co/docs/transformers/model_doc/beit
https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/beit/modeling_beit.py
Comments
As an example pls see the latest model additions: