The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.
Basic Markdown Scraping
The easiest way to scrape a webpage is by extracting its content in Markdown format. This is useful for preserving the page’s structure and formatting.
from lumiteh_sdk import LumiTehClientclient =LumiTehClient()markdown = client.scrape(url="https://www.lumiteh.io",only_main_content=True,)print(markdown)
Structured Data Extraction
For more advanced use cases, you can extract structured data from web pages by defining a schema with Pydantic models. This approach is especially useful when you need to capture specific information, such as product details, pricing plans, or article metadata.
Example: Extracting Pricing Plans
Suppose you want to extract pricing information from a website. First, define your data models, then use them to extract structured data:
Agent Scraping
Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.
Topics & Tips
Scrape API vs Agent Scrape
Scrape API
Perfect for
1. One-off scraping tasks
2. Simple data extraction
3. Static content
Agent Scrape
Perfect for
1. Authentication or login flows
2. Form filling and submission
3. Dynamic content
Response Format Best Practices
Use response_format whenever possible to yield the best & most reliable results:
Tips for designing schemas:
Try a few different schemas to find what works best
If you ask for a company_name field but there is no company_name on the page, LLM scraping will fail
Design your schema carefully based on the actual content structure
Response format is available for both scrape and agent.run
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient
class PricingPlan(BaseModel):
name: str
price_per_month: int | None = None
features: list[str]
class PricingPlans(BaseModel):
plans: list[PricingPlan]
client = LumiTehClient()
data = client.scrape(
url="https://www.lumiteh.io",
instructions="Extract the pricing plans from the page",
response_format=PricingPlans
)
# plans is a PricingPlans instance
# > note that the following line can raise an exception
# in case of a scraping error
plans = data.get()
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient
class LinkedInConversation(BaseModel):
recipient: str
messages: list[str]
client = LumiTehClient()
vault = client.Vault(vault_id="<your-vault-id>")
with client.Session() as session:
agent = client.Agent(session=session, vault=vault, max_steps=15)
response = agent.run(
task="Go to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
response_format=LinkedInConversation
)
print(response.answer)