LumiTeh BUA
https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub

Overview
Browser-Using Agents (BUA) follow the Computer-Using Agent (CUA) model popularized by OpenAI, but are extended to operate specifically in browser environments.
Traditional CUA models combine the vision and reasoning capabilities of LLMs to simulate control over computer interfaces and perform tasks. Browser-Using Agents focus solely on the browser as the primary interface, as giving AI agents access to the page DOM can significantly enhance their performance.
BUA is accessible via the bua/completions
endpoint.
How it works?
In terms of input, in addition to the traditional CUA Screenshot + Prompt method, BUA also utilizes the page’s DOM to enhance understanding and reasoning of web content. This is illustrated in the figure below.
Setting up your environment
Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page.
We advise using playwright
for this purpose.You can check out the library for an example implementation, in particular:
computer.screenshot()
computer.dom()
Integrating the BUA loop
1. Send a request to the model
The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page
from lumiteh import LumiTehClient
lumiteh = LumiTehClient()
# only for reference, this response will not work
response = lumiteh.bua.responses.create(
params=[{
"display_width": 1024,
"display_height": 768,
}],
input=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Check the latest job offers on the careers page of lumiteh.io."
},
{
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshot_base64}"
},
{
"type": "input_dom_json",
"dom_tree": f"\{<DOM_TREE>\}"
}
]
}
],
)
print(response.output)
2. Receive a suggested action
The response will provide a sequence of actions to help achieve the specified goal. These actions may include clicking at a specific position, entering text, scrolling, or pausing as needed.
{
"type": "browser_call",
"id": "9e59fa10-9261-4c8b-a89a-7bfbeae26eda",
"call_id": "f8c96d4a-d424-4047-9e8b-4d83d292e749",
"state": {
"previous_goal_status": "unknown",
"previous_goal_eval": "I have successfully navigated to the website lumiteh.io.",
"page_summary": "The page is the homepage of LumiTeh, a web agent framework. It has links to Product, Use Cases, Pricing, Docs, and Careers. It also has buttons to Sign Up, Get started for free, and Book a demo.",
"relevant_interactions": [
{
"id": "L5",
"reason": "The link L5 leads to the careers page, where I can find the jobs offered by LumiTeh."
}
],
"memory": "Navigated to lumiteh.io",
"next_goal": "Find the jobs offered by LumiTeh."
},
"action": {
"id": "L5",
"selectors": {
"css_selector": "html > body > div > header > div > div > div:nth-of-type(2) > nav > a:nth-of-type(4).text-base.font-normal.text-muted-foreground.transition-colors[target=\"_blank\"][href=\"https://lumiteh.notion.site/jobs-for-humans\"]",
"xpath_selector": "html/body/div/header/div/div/div[2]/nav/a[4]",
"lumiteh_selector": "",
"in_iframe": false,
"in_shadow_root": false,
"iframe_parent_css_selectors": [],
"playwright_selector": null
},
"type": "click"
}
}
How you map a browser call to actions through code depends on your environment. If you are using playwright
as your browser automation library, we already have a library that maps the browser calls to playwright actions:
bua-playwright-agent. (GITHUB)
Last updated