Advanced Scraping
https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub
Scrape any page and get formatted data
The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.
Basic Markdown Scraping
The easiest way to scrape a webpage is by extracting its content in Markdown format. This is useful for preserving the page’s structure and formatting.
from lumiteh_sdk import LumiTehClient
client = LumiTehClient()
markdown = client.scrape(
url="https://www.lumiteh.io",
only_main_content=True,
)
print(markdown)
Structured Data Extraction
For more advanced use cases, you can extract structured data from web pages by defining a schema with Pydantic models. This approach is especially useful when you need to capture specific information, such as product details, pricing plans, or article metadata.
Example: Extracting Pricing Plans
Suppose you want to extract pricing information from a website. First, define your data models, then use them to extract structured data:
Agent Scraping
Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.
Topics & Tips
Scrape API vs Agent Scrape
Scrape API
Perfect for
1. One-off scraping tasks
2. Simple data extraction
3. Static content
Response Format Best Practices
Use response_format whenever possible to yield the best & most reliable results:
Tips for designing schemas:
Try a few different schemas to find what works best
If you ask for a
company_namefield but there is nocompany_nameon the page, LLM scraping will failDesign your schema carefully based on the actual content structure
Response format is available for both
scrapeandagent.run
Example of good schema design:
Last updated