Guide: Image, Video, and Audio Chat

This guide explains how to use the reka.chat() function to chat based on an image, a video, or an audio clip.

Full documentation of all the parameters of reka.chat() is given in the python client library documentation.

Image understanding

The reka.chat() method accepts the parameters media_url and media_filename that can be used to set the image, video, or audio clip for the current chat round. The media_url parameter should either point to a (publicly) accessible media file, or be a full data url with the file encoded in base64, e.g. data:image/jpeg;base64,<base64 encoding>.

For convenience, the media_filename parameter allows you to point to a local filename and the file will be base64-encoded and sent to the apiserver.

For example, let’s ask about the below image of a cat.

Image of a cat from the COCO dataset
response = reka.chat(
    "What's in the photo?",
    media_url="https://docs.reka.ai/_images/000000245576.jpg",
)

Giving a response like:

{
  "type": "model",
  "text": "An orange cat is sitting on a desk in front of a keyboard.\n\n"
}

Video understanding

You can conduct a chat based on a video, by setting media_url (or media_filename) to a video file. We recommend sticking to short videos, less than around 30 seconds.

Audio understanding

You can conduct a chat based on audio clips, by setting media_url (or media_filename) to an audio file. We recommend sticking to short clips, less than around 30 seconds.

Multiple turns

You can continue the conversation from the previous section by appending the model turn to the conversation history. The media_url (or media_filename) must be included in the conversation_history in the first turn:

response = reka.chat(
    "What is the cat doing?",
    conversation_history=[
        {
            "type": "human",
            "text": "What's in the photo?",
            "media_url": "http://images.cocodataset.org/val2017/000000245576.jpg",
        },
        {"type": "model", "text": "An orange cat is sitting on a desk in front of a keyboard.\n\n"},
    ],
)

This should return a response like:

{
  "type": "model",
  "text": "The cat appears to be sniffing the computer keyboard.\n\n"
}

Advanced usage

The reka.chat() method has various parameters that can be used to refine its outputs. The example below demonstrates using various parameters, including using a final model turn to prompt the model to generate a numbered list, and using stop_words to stop generation after 3 items.

import reka

response = reka.chat(
    conversation_history=[
        {
            "type": "human",
            "text": "Write a list of questions based on this image I can use as a quiz.",
            "media_url": "http://images.cocodataset.org/val2017/000000245576.jpg",
        },
        {
            "type": "model",
            "text": "Sure, here are some questions based on the image you provided:\n1.",
        },
    ],
    request_output_len=1024,
    temperature=0.6,
    runtime_top_k=1024,
    runtime_top_p=0.95,
    stop_words=["4."],
)

print("1." + response["text"])

This should output something like:

1. What is the color of the cat in the image?
2. Where is the cat positioned in relation to the keyboard and desk?
3. Can you describe any items visible on top of the desk that might suggest this space is used for work or study?