Advanced Scraping
https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub
Scrape any page and get formatted data
The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.
Basic Markdown Scraping
The easiest way to scrape a webpage is by extracting its content in Markdown format. This is useful for preserving the page’s structure and formatting.
from lumiteh_sdk import LumiTehClient
client = LumiTehClient()
markdown = client.scrape(
url="https://www.lumiteh.io",
only_main_content=True,
)
print(markdown)
Structured Data Extraction
For more advanced use cases, you can extract structured data from web pages by defining a schema with Pydantic models. This approach is especially useful when you need to capture specific information, such as product details, pricing plans, or article metadata.
Example: Extracting Pricing Plans
Suppose you want to extract pricing information from a website. First, define your data models, then use them to extract structured data:
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient
class PricingPlan(BaseModel):
name: str
price_per_month: int | None = None
features: list[str]
class PricingPlans(BaseModel):
plans: list[PricingPlan]
client = LumiTehClient()
data = client.scrape(
url="https://www.lumiteh.io",
instructions="Extract the pricing plans from the page",
response_format=PricingPlans
)
# plans is a PricingPlans instance
# > note that the following line can raise an exception
# in case of a scraping error
plans = data.get()
Agent Scraping
Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient
class LinkedInConversation(BaseModel):
recipient: str
messages: list[str]
client = LumiTehClient()
vault = client.Vault(vault_id="<your-vault-id>")
with client.Session() as session:
agent = client.Agent(session=session, vault=vault, max_steps=15)
response = agent.run(
task="Go to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
response_format=LinkedInConversation
)
print(response.answer)
Topics & Tips
Scrape API vs Agent Scrape
Scrape API
Perfect for
1. One-off scraping tasks
2. Simple data extraction
3. Static content
Response Format Best Practices
Use response_format
whenever possible to yield the best & most reliable results:
Tips for designing schemas:
Try a few different schemas to find what works best
If you ask for a
company_name
field but there is nocompany_name
on the page, LLM scraping will failDesign your schema carefully based on the actual content structure
Response format is available for both
scrape
andagent.run
Example of good schema design:
from pydantic import BaseModel
class Product(BaseModel):
product_url: str
name: str
price: float | None = None
description: str | None = None
image_url: str | None = None
class ProductList(BaseModel):
products: list[Product]
Last updated