Implementing Vision-Powered Chit-Chats with Robots: A GPT-4 Adventure 🤖👀

Giulio A. Abbo and Tony Belpaeme

Hey there! 👋 Imagine a world where your favourite chatbot or social robot isn’t just responding to text-based inputs but is also getting a real-time visual sneak peek into the conversation. Exciting, right? Well, we implemented just that with the help of GPT-4, and I’ll explain how you can do it too! But first, here’s a video showing the final result:

Check out our paper I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots for more details.

In this short adventure, we’ll explore how to use large language models and live visual input from a webcam, mix them in an effective prompt, and summarise this to make it run faster and cheaper. We’ll be creating a conversational experience that’s actually context-aware. Want to dive straight into the code and try it yourself with a webcam or a Furhat robot? Here is the repo. Ready to start? Let’s go!

🖼️ GPT-4 and Images

To start, you’ll need an OpenAI account and to get yourself an API key. I know… I would have liked an open-source alternative too, but we’ve tried IDEFICS and LLaVA without good results. So GPT-4 it is for now!

We’ll be using Python: run pip install openai opencv-python to get the libraries we need. These are a few lines of code to get you started with GPT-4 vision.

from openai import OpenAI

client = OpenAI(api_key="YOUR-KEY-HERE")
def query(prompt) -> str:
    params = {
        "model": "gpt-4-vision-preview",
        "messages": prompt,
        "max_tokens": 200,
    }
    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

📜 The Prompt

The prompt that you want to send to GPT-4 has a somewhat complex structure, but this is what has worked reliably so far. Basically it’s an array of messages. As you probably already know, GPT-4 supports different kinds of messages. Here’s a quick overview.

The System Message instructs the model on how to behave. It has this structure:

def format_system(content):
    return {"role": "system", "content": content}

To add text from the user, or base64 images (more on how to load images below), you’ll want to use something like this:

def format_text(content):
    return {
        "role": "user",
        "content": [{"type": "text", "text": content}],
    }

def format_image(content):
    return {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64," + content},
            }
        ],
    }

And finally, when you want to incorporate GPT-4’s responses into the prompt, this will be how:

def format_assistant(content):
    return {
        "role": "assistant",
        "content": [{"type": "text", "text": content}],
    }

Now, to put together all the text and images you have different options. It is important to keep the right ordering of the elements, and the easiest way it’s going to be a big list. The repo with the code of this project contains a Conversation class that does this (and other things too, more on this later).

📷 Taking pictures

In this example, during the conversation with the system, we’re going to incorporate images in our prompt by taking a picture with the webcam at the beginning of the user’s turn. In the repo you will find how to continuously take snapshots during the conversation at fixed intervals, load a video, or use a Furhat robot as the video source. Here, we will just open the webcam, take a pic, encode it into a string, close the webcam, and return the string.

def get_image():
    vid = cv2.VideoCapture(0)
    _, frame = vid.read()
    _, buffer = cv2.imencode(".jpg", frame)
    string64 = base64.b64encode(buffer).decode("utf-8")
    vid.release()
    return string64

🪄 The System Prompt

Awesome! We have all the components ready… except one: the system prompt. We have to tell GPT-4 how to interpret the images that we send, and how to respond. This takes patience and time, many trials and a bit of prompt engineering magic. Let’s cut to the chase and have a peek at the prompt that gave us the results we liked the most.

system = (
    "You are impersonating a friendly kid. "
    "In this conversation, what you see is represented by the images. "
    "For example, the images will show you the environment you are in and possibly the person you are talking to. "
    "Try to start the conversation by saying something about the person you are talking to if there is one, based on accessories, clothes, etc. "
    "If there is no person, try to say something about the environment, but do not describe the environment! "
    "Have a nice conversation and try to be curious! "
    "It is important that you keep your answers short and to the point. "
    "DO NOT INCLUDE EMOTICONS OR SMILEYS IN YOUR ANSWERS. "
)

As you can see, we ask the model to impersonate a friendly kid, sounds strange but this removes most of those annoying warnings and disclaimers from the output of GPT-4. Then we tell the model that the images are what it sees, and that it would be nice to start the conversation by saying something nice about it. GPT-4 will try hard to describe everything it sees and we don’t want that; we also don’t want the model to ramble on forever so we tell it not to. Finally, the friendly kid persona that we summoned loves putting emojis in its answers, they’re of no use to us so we ask to not include them, in uppercase, just to make it extra clear and loud.

🧩 Put Everything Together

Let’s glue all of this together, shall we?

Ta-daa! A nice infinite loop and it’s done! Save this with a nice name like main.py and run it with python main.py. Fingers crossed, and if everything goes well you’ll be taking pics from the webcam, and having a nice chat about it. Nice, isn’t it? Have fun exploring what happens when you turn off the lights and how GPT-4 answers to the weirdest scenarios. Be sure to follow OpenAI terms of use and keep an eye on the bill, as sending a lot of full-res pictures can be pricey.

As said, in the repo you can find a version that continuously captures frames from a webcam, from a video or a Furhat robot.

✂️ Cut the prompt size

You’ll quickly notice that your prompt will get too big, with slowed-down computation and increased prices. No good. To solve that, we thought of doing what’s done with normal dialogue prompts: ask the LLM to summarise the first part of the conversation!

But we can’t summarise images and dialogue together, a picture is worth a thousand words and our dialogue will virtually disappear in a sea of image descriptions. Remember when I told you that the Conversation class in the repo was doing other things too? Well, when the prompt gets too long, this class asks GPT-4 to summarise some of the images in it. It will scan all the messages list, find the first n consecutive images, and substitute them with a summary. If you are interested, this paper contains more details about it.

Here is the code that we used in the Conversation class.
def get_fr_summary(self) -> Tuple[List[Message], int]:
    """Summarise the frames and return the new messages and the number of frames removed."""
    # fr_buff_size is the max number of images (frames) in the prompt
    # fr_recap is the max number of frames to summarise
    # Assuming number of frames in prompt > fr_buff_size > fr_recap

    # Find the first frame and the last frame to summarise
    first_fr = None
    i = None
    for i, m in enumerate(self._messages):
        if m.is_frame():
            if first_fr is None:
                first_fr = i

        # Include at most fr_recap frames, and stop if we see a user message
        if first_fr is not None and (m.is_user() or i - first_fr >= self.fr_recap):
            break

    # Split the messages list
    before = self._messages[:first_fr]
    to_summarise = self._messages[first_fr:i]
    after = self._messages[i:]

    # Generate the summary
    prompt = [
        SystemMessage(
            "These are frames from a video. Summarise what's happening in the video in one sentence. "
            "The frames are preceded by a context to help you summarise the video. "
            "Summarise only the frames, not the context."
            "The images can be repeating, this is normal, do not point this out in the description."
            "Respond with only the summary in one sentence. This is very important. "
            "Do not include warnings or other messages."
        ).gpt_format(),
        *[b.gpt_format() for b in before],
        *[s.gpt_format() for s in to_summarise],
    ]
    summary = self.llm.query(prompt)

    # Generate the new message list with the summary
    messages = [
        *before,
        FSummaryMessage(summary),
        *after,
    ]

    return messages, i - first_fr

🔭 What now?

I hope this journey into combining GPT-4 with real-time visual input has sparked your curiosity! The possibilities are as vast as your imagination. Now armed with the knowledge to integrate large language models and live visual input, you can create a truly interactive and context-aware conversational experience. So, what are you waiting for? Dive into the code, explore the fascinating intersection of language and vision, and let your creativity run wild. The future of chatbots and social robots is not just text-based – it’s a dynamic fusion of words and images, and you’re at the forefront of it. We’ll continue to work to improve this approach and explore new and exciting ways to make conversational agents better. Stay tuned!