adept/fuyu-8b · Hugging Face (2024)

We’re releasing Fuyu-8B, a small version of the multimodal model that powers our product. The model is available on HuggingFace. We think Fuyu-8B is exciting because:

  1. It has a much simpler architecture and training procedure than other multi-modal models, which makes it easier to understand, scale, and deploy.
  2. It’s designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.
  3. It’s fast - we can get responses for large images in less than 100 milliseconds.
  4. Despite being optimized for our use-case, it performs well at standard image understanding benchmarks such as visual question-answering and natural-image-captioning.

Please note that the model we have released is a base model. We expect you to need to finetune the model for specific use cases like verbose captioning or multimodal chat. In our experience, the model responds well to few-shotting and fine-tuning for a variety of use-cases.

Model

Fuyu-8B is a multi-modal text and image transformer trained by Adept AI.

Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the transformer decoder like an image transformer (albeit with no pooling and causal attention).See the below diagram for more details.

This simplification allows us to support arbitrary image resolutions. To accomplish this, we treat the sequence of image tokens like the sequence of text tokens. We remove image-specific position embeddings and feed in as many image tokens as necessary in raster-scan order. To tell the model when a line has broken, we simply use a special image-newline character. The model can use its existing position embeddings to reason about different image sizes, and we can use images of arbitrary size at training time, removing the need for separate high and low-resolution training stages.

Model Description

  • Developed by: Adept-AI
  • Model type: Decoder-only multi-modal transformer model
  • License: CC-BY-NC
  • Model Description: This is a multi-modal model that can consume images and text and produce text.
  • Resources for more information: Check out our blog post.

Evaluation

Though not the focus of this model, we did evaluate it on standard image understanding benchmarks:

Eval TaskFuyu-8BFuyu-MediumLLaVA 1.5 (13.5B)QWEN-VL (10B)PALI-X (55B)PALM-e-12BPALM-e-562B
VQAv274.277.48079.586.176.280.0
OKVQA60.663.1n/a58.666.155.566.1
COCO Captions141138n/an/a149135138
AI2D64.573.7n/a62.381.2n/an/a

How to Use

You can load the model and perform inference as follows:

from transformers import FuyuProcessor, FuyuForCausalLMfrom PIL import Imageimport requests# load model and processormodel_id = "adept/fuyu-8b"processor = FuyuProcessor.from_pretrained(model_id)model = FuyuForCausalLM.from_pretrained(model_id, device_map="cuda:0")# prepare inputs for the modeltext_prompt = "Generate a coco-style caption.\n"url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"image = Image.open(requests.get(url, stream=True).raw)inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")# autoregressively generate textgeneration_output = model.generate(**inputs, max_new_tokens=7)generation_text = processor.batch_decode(generation_output[:, -7:], skip_special_tokens=True)assert generation_text == ['A blue bus parked on the side of a road.']

N.B.: The token |SPEAKER| is a placeholder token for image patch embeddings, so it will show up in the model context (e.g., in the portion of generation_output representing the model context).|NEWLINE| is the "image newline" token, denoting new rows in the raster scan order input of the image patches.\x04 is the "beginning of answer" token.

Fuyu can also perform some question answering on natural images and charts/diagrams (thought fine-tuning may be required for good performance):

text_prompt = "What color is the bus?\n"url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"image = Image.open(requests.get(url, stream=True).raw)inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")generation_output = model.generate(**inputs, max_new_tokens=6)generation_text = processor.batch_decode(generation_output[:, -6:], skip_special_tokens=True)assert generation_text == ["The bus is blue.\n"]text_prompt = "What is the highest life expectancy at birth of male?\n"url = "https://huggingface.co/adept/fuyu-8b/resolve/main/chart.png"image = Image.open(requests.get(url, stream=True).raw)model_inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")generation_output = model.generate(**model_inputs, max_new_tokens=16)generation_text = processor.batch_decode(generation_output[:, -16:], skip_special_tokens=True)assert generation_text == ["The life expectancy at birth of males in 2018 is 80.7.\n"]

For best performance, it's recommended to end questions with \n, as shown above!

Uses

Direct Use

The model is intended for research purposes only. Because this is a raw model release, we have not added further finetuning, postprocessing or sampling strategies to control for undesirable outputs. You should expect to have to fine-tune the model for your use-case.

Possible research areas and tasks include

  • Applications in computer control or digital agents.
  • Research on multi-modal models generally.

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Limitations and Bias

Limitations

  • Faces and people in general may not be generated properly.

Bias

While the capabilities of these models are impressive, they can also reinforce or exacerbate social biases.

adept/fuyu-8b · Hugging Face (2024)
Top Articles
50 Vintage Recipes from the '40s Worth Trying Today
30 Thrifty Holiday Recipes from the '30s
Melson Funeral Services Obituaries
Sound Of Freedom Showtimes Near Governor's Crossing Stadium 14
Mopaga Game
THE 10 BEST Women's Retreats in Germany for September 2024
Https Www E Access Att Com Myworklife
Find your energy supplier
Www.paystubportal.com/7-11 Login
Best Food Near Detroit Airport
Guilford County | NCpedia
Youravon Comcom
Truck Trader Pennsylvania
Scenes from Paradise: Where to Visit Filming Locations Around the World - Paradise
Cyndaquil Gen 4 Learnset
Urban Airship Expands its Mobile Platform to Transform Customer Communications
Ruben van Bommel: diepgang en doelgerichtheid als wapens, maar (nog) te weinig rendement
Craigslist West Valley
bode - Bode frequency response of dynamic system
Yard Goats Score
Lola Bunny R34 Gif
Yisd Home Access Center
Walgreens Bunce Rd
Does Hunter Schafer Have A Dick
Naya Padkar Gujarati News Paper
Speedstepper
Doctors of Optometry - Westchester Mall | Trusted Eye Doctors in White Plains, NY
Infinite Campus Asd20
Rainfall Map Oklahoma
Possum Exam Fallout 76
Pay Stub Portal
Mosley Lane Candles
Sam's Club Gas Price Hilliard
Grand Teton Pellet Stove Control Board
Metro 72 Hour Extension 2022
Samsung 9C8
Powerspec G512
Craigs List Jonesboro Ar
Craigslist Boats Eugene Oregon
Los Garroberros Menu
Ksu Sturgis Library
Devotion Showtimes Near The Grand 16 - Pier Park
5A Division 1 Playoff Bracket
Guy Ritchie's The Covenant Showtimes Near Grand Theatres - Bismarck
Unveiling Gali_gool Leaks: Discoveries And Insights
Tricare Dermatologists Near Me
Blackwolf Run Pro Shop
New Starfield Deep-Dive Reveals How Shattered Space DLC Will Finally Fix The Game's Biggest Combat Flaw
Aznchikz
Bbwcumdreams
Google Flights Missoula
The Hardest Quests in Old School RuneScape (Ranked) – FandomSpot
Latest Posts
Article information

Author: Greg Kuvalis

Last Updated:

Views: 6844

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.