Advanced Scraping

https://x.com/LumiTeh / https://www.lumiteh.com/ / https://github.com/LumiTeh-hub

Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.

​Basic Markdown Scraping

The easiest way to scrape a webpage is by extracting its content in Markdown format. This is useful for preserving the page’s structure and formatting.

from lumiteh_sdk import LumiTehClient

client = LumiTehClient()
markdown = client.scrape(
    url="https://www.lumiteh.io",
    only_main_content=True,
)
print(markdown)

Structured Data Extraction

For more advanced use cases, you can extract structured data from web pages by defining a schema with Pydantic models. This approach is especially useful when you need to capture specific information, such as product details, pricing plans, or article metadata.

​Example: Extracting Pricing Plans

Suppose you want to extract pricing information from a website. First, define your data models, then use them to extract structured data:

Agent Scraping

Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.

Topics & Tips

​Scrape API vs Agent Scrape

Scrape API

Perfect for

1. One-off scraping tasks

2. Simple data extraction

3. Static content

Agent Scrape

Perfect for

1. Authentication or login flows

2. Form filling and submission

3. Dynamic content

Response Format Best Practices

Tips for designing schemas:

  • Try a few different schemas to find what works best

  • If you ask for a company_name field but there is no company_name on the page, LLM scraping will fail

  • Design your schema carefully based on the actual content structure

  • Response format is available for both scrape and agent.run

Example of good schema design:

Last updated