LumiTeh BUA

https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub

Overview

Browser-Using Agents (BUA) follow the Computer-Using Agent (CUA) model popularized by OpenAI, but are extended to operate specifically in browser environments.

Traditional CUA models combine the vision and reasoning capabilities of LLMs to simulate control over computer interfaces and perform tasks. Browser-Using Agents focus solely on the browser as the primary interface, as giving AI agents access to the page DOM can significantly enhance their performance.

BUA is accessible via the bua/completions endpoint.

How it works?

In terms of input, in addition to the traditional CUA Screenshot + Prompt method, BUA also utilizes the page’s DOM to enhance understanding and reasoning of web content. This is illustrated in the figure below.

1

Send a request to `bua/completions`

Include the computer tool among the available tools, specifying its display size and environment. You can also provide a screenshot of the environment’s initial state in the first request.

2

Receive a response from the BUA model

The response will include a list of actions to help achieve the specified goal. These actions may involve clicking at a specific position, entering text, scrolling, or even waiting.

3

Execute the requested action

Execute through code the corresponding action on your browser environment.

4

Capture the updated state

After executing the action, capture the updated state of the environment as a screenshot.

5

Repeat

Send a new request with the updated state as a computer_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.

Setting up your environment

Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page.

We advise using playwright for this purpose.You can check out the library for an example implementation, in particular:

  • computer.screenshot()

  • computer.dom()

Integrating the BUA loop

​1. Send a request to the model

The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page

2. Receive a suggested action

The response will provide a sequence of actions to help achieve the specified goal. These actions may include clicking at a specific position, entering text, scrolling, or pausing as needed.

How you map a browser call to actions through code depends on your environment. If you are using playwright as your browser automation library, we already have a library that maps the browser calls to playwright actions:

bua-playwright-agent. (GITHUB)

Last updated