LumiTeh BUA

https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub

Overview

Browser-Using Agents (BUA) follow the Computer-Using Agent (CUA) model popularized by OpenAI, but are extended to operate specifically in browser environments.

Traditional CUA models combine the vision and reasoning capabilities of LLMs to simulate control over computer interfaces and perform tasks. Browser-Using Agents focus solely on the browser as the primary interface, as giving AI agents access to the page DOM can significantly enhance their performance.

BUA is accessible via the bua/completions endpoint.

How it works?

In terms of input, in addition to the traditional CUA Screenshot + Prompt method, BUA also utilizes the page’s DOM to enhance understanding and reasoning of web content. This is illustrated in the figure below.

Send a request to `bua/completions`

Include the computer tool among the available tools, specifying its display size and environment. You can also provide a screenshot of the environment’s initial state in the first request.

Receive a response from the BUA model

The response will include a list of actions to help achieve the specified goal. These actions may involve clicking at a specific position, entering text, scrolling, or even waiting.

Execute the requested action

Execute through code the corresponding action on your browser environment.

Capture the updated state

After executing the action, capture the updated state of the environment as a screenshot.

Repeat

Send a new request with the updated state as a computer_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.

Setting up your environment

Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page.

We advise using playwright for this purpose.You can check out the library for an example implementation, in particular:

computer.screenshot()
computer.dom()

Integrating the BUA loop

1. Send a request to the model

The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page

from lumiteh import LumiTehClient

lumiteh = LumiTehClient()

# only for reference, this response will not work
response = lumiteh.bua.responses.create(
    params=[{
        "display_width": 1024,
        "display_height": 768,
    }],
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Check the latest job offers on the careers page of lumiteh.io."
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}"
                },
                {
                    "type": "input_dom_json",
                    "dom_tree": f"\{<DOM_TREE>\}"
                }
            ]
        }
    ],
)

print(response.output)

2. Receive a suggested action

The response will provide a sequence of actions to help achieve the specified goal. These actions may include clicking at a specific position, entering text, scrolling, or pausing as needed.

{
  "type": "browser_call",
  "id": "9e59fa10-9261-4c8b-a89a-7bfbeae26eda",
  "call_id": "f8c96d4a-d424-4047-9e8b-4d83d292e749",
  "state": {
    "previous_goal_status": "unknown",
    "previous_goal_eval": "I have successfully navigated to the website lumiteh.io.",
    "page_summary": "The page is the homepage of LumiTeh, a web agent framework. It has links to Product, Use Cases, Pricing, Docs, and Careers. It also has buttons to Sign Up, Get started for free, and Book a demo.",
    "relevant_interactions": [
      {
        "id": "L5",
        "reason": "The link L5 leads to the careers page, where I can find the jobs offered by LumiTeh."
      }
    ],
    "memory": "Navigated to lumiteh.io",
    "next_goal": "Find the jobs offered by LumiTeh."
  },
  "action": {
    "id": "L5",
    "selectors": {
      "css_selector": "html > body > div > header > div > div > div:nth-of-type(2) > nav > a:nth-of-type(4).text-base.font-normal.text-muted-foreground.transition-colors[target=\"_blank\"][href=\"https://lumiteh.notion.site/jobs-for-humans\"]",
      "xpath_selector": "html/body/div/header/div/div/div[2]/nav/a[4]",
      "lumiteh_selector": "",
      "in_iframe": false,
      "in_shadow_root": false,
      "iframe_parent_css_selectors": [],
      "playwright_selector": null
    },
    "type": "click"
  }
}

How you map a browser call to actions through code depends on your environment. If you are using playwright as your browser automation library, we already have a library that maps the browser calls to playwright actions:

bua-playwright-agent. (GITHUB)

PreviousPersona NextTowards Reliable Agents

Last updated 3 months ago