Model card for Pix2Struct - Finetuned on AI2D (scientific diagram VQA)
Table of Contents
TL;DR
Using the model
Contribution
Citation
TL;DR
Pix2Struct is an…
Model Details: DPT-Hybrid (also known as MiDaS 3.0)
Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation.
It was introduced in the paper Vision…
Edit model card
OpenVoice
Features
How to Use
Links
OpenVoice
OpenVoice, a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker…
In Loving memory of Simon Mark Hughes...
Introduction
The HHEM model is an open source model, created by Vectara, for detecting hallucinations in LLMs. It is particularly useful…
LSTP-Chat: Language-guided Spatial-Temporal Prompt Learning for Video Chat
Available Models:
LSTP-Chat-7B (Vicuna-7b)
For more details, please refer to our official repository
Source link
Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer
Without Convolution or Region Supervision by Kim et…
Depth Anything (small-sized model, Transformers version)
Depth Anything model. It was introduced in the paper Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang et al.…
Edit model card
WhisperSpeech
Progress update [2024-01-18]
Progress update [2024-01-10]
Progress update [2023-12-10]
Downloads
Roadmap
Architecture
Whisper for modeling semantic tokens
EnCodec for modeling acoustic tokens
Appreciation
Consulting
Citations
…
