A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

ocr computer-vision artificial-intelligence text-recognition document text-detection document-analysis end-to-end-ocr multimodal scene-text-recognition multimodal-deep-learning scene-text-detection vision-language document-understanding scene-text-detection-recognition document-recognition document-intelligence documentai vision-language-transformer vision-language-model

Updated Jul 15, 2024
C++

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

ai gcc multimodality vlm cradle computer-control lmm grounding ai-agent large-language-models llm generative-ai vision-language-model ai-agents-framework general-computer-control personoid foundation-agent

Updated Jul 15, 2024
Python

roboflow / multimodal-maestro

Star

Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA or CogVLM. 🔥

object-detection cross-modal multimodality instance-segmentation lmm gpt-4 visual-prompting prompt-engineering vision-language-model llava segment-anything gpt-4-vision

Updated Feb 13, 2024
Python

llm-jp / awesome-japanese-llm

Star

日本語LLMまとめ - Overview of Japanese LLMs

japanese generative-model japanese-language language-models language-model generative-models multimodal vision-and-language vision-language foundation-models large-language-models llm llms generative-ai large-language-model vision-language-model japanese-llm japanese-language-model llm-japanese

Updated Jul 11, 2024
TypeScript

PKU-YuanGroup / Chat-UniVi

Star

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

video-understanding image-understanding large-language-models vision-language-model

Updated Jul 9, 2024
Python

mbzuai-oryx / groundingLMM

Star

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

vision-and-language lmm foundation-models vision-language-model llm-agent

Updated Jun 2, 2024
Python

SunzeY / AlphaCLIP

Star

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

machine-learning deep-learning vision-and-language vision-language vision-transformer vision-language-model

Updated Mar 4, 2024
Jupyter Notebook

AlaaLab / InstructCV

Star

[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

generative-model text-to-image multi-task-learning diffusion-models stable-diffusion vision-language-model

Updated Apr 27, 2024
Python

FoundationVision / Groma

Star

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

llama multimodal grounding foundation-models large-language-models llm mllm vision-language-model llama2

Updated Jun 7, 2024
Python

huangwl18 / VoxPoser

Star

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

robotics motion-planning robotic-manipulation embodied-ai foundation-models large-language-models vision-language-model

Updated May 8, 2024
Python

OpenGVLab / Multi-Modality-Arena

Star

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

chat chatbot vqa gradio multi-modality large-language-models llms chatgpt vision-language-model

Updated Apr 21, 2024
Python

PJLab-ADG / awesome-knowledge-driven-AD

Star

A curated list of awesome knowledge-driven autonomous driving (continually updated)

autonomous-driving knowledge-driven large-language-models vision-language-model

Updated Jun 7, 2024

Improve this page

Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision-language-model

Here are 151 public repositories matching this topic...

haotian-liu / LLaVA

QwenLM / Qwen-VL

OpenGVLab / InternVL

dvlab-research / MGM

InternLM / InternLM-XComposer

jingyi0000 / VLM_survey

deepseek-ai / DeepSeek-VL

NVlabs / prismer

AlibabaResearch / AdvancedLiterateMachinery

BAAI-Agents / Cradle

roboflow / multimodal-maestro

llm-jp / awesome-japanese-llm

PKU-YuanGroup / Chat-UniVi

mbzuai-oryx / groundingLMM

SunzeY / AlphaCLIP

AlaaLab / InstructCV

FoundationVision / Groma

huangwl18 / VoxPoser

OpenGVLab / Multi-Modality-Arena

PJLab-ADG / awesome-knowledge-driven-AD

Improve this page

Add this topic to your repo