# Advanced Scraping

Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats.

## ​Basic Markdown Scraping <a href="#basic-markdown-scraping" id="basic-markdown-scraping"></a>

The easiest way to scrape a webpage is by extracting its content in Markdown format. This is useful for preserving the page’s structure and formatting.

```python
from lumiteh_sdk import LumiTehClient

client = LumiTehClient()
markdown = client.scrape(
    url="https://www.lumiteh.io",
    only_main_content=True,
)
print(markdown)

```

## Structured Data Extraction <a href="#structured-data-extraction" id="structured-data-extraction"></a>

For more advanced use cases, you can extract structured data from web pages by defining a schema with Pydantic models. This approach is especially useful when you need to capture specific information, such as product details, pricing plans, or article metadata.

### **​Example: Extracting Pricing Plans**

Suppose you want to extract pricing information from a website. First, define your data models, then use them to extract structured data:

```python
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient

class PricingPlan(BaseModel):
    name: str
    price_per_month: int | None = None
    features: list[str]

class PricingPlans(BaseModel):
    plans: list[PricingPlan]

client = LumiTehClient()
data = client.scrape(
    url="https://www.lumiteh.io",
    instructions="Extract the pricing plans from the page",
    response_format=PricingPlans
)

# plans is a PricingPlans instance
# > note that the following line can raise an exception
# in case of a scraping error
plans = data.get()

```

## Agent Scraping <a href="#agent-scraping" id="agent-scraping"></a>

Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.

```python
from pydantic import BaseModel
from lumiteh_sdk import LumiTehClient

class LinkedInConversation(BaseModel):
    recipient: str
    messages: list[str]

client = LumiTehClient()
vault = client.Vault(vault_id="<your-vault-id>")

with client.Session() as session:
    agent = client.Agent(session=session, vault=vault, max_steps=15)
    response = agent.run(
        task="Go to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
        response_format=LinkedInConversation
    )
print(response.answer)

```

## Topics & Tips <a href="#topics-26-tips" id="topics-26-tips"></a>

### ​Scrape API vs Agent Scrape <a href="#scrape-api-vs-agent-scrape" id="scrape-api-vs-agent-scrape"></a>

{% columns %}
{% column %}

<p align="center"><strong>Scrape API</strong></p>

<p align="center"><em>Perfect for</em></p>

<p align="center"><strong>1.</strong> One-off scraping tasks</p>

<p align="center"><strong>2.</strong> Simple data extraction</p>

<p align="center"><strong>3.</strong> Static content</p>
{% endcolumn %}

{% column %}

<h4 align="center">Agent Scrape</h4>

<p align="center"><em>Perfect for</em></p>

<p align="center"><strong>1.</strong> Authentication or login flows</p>

<p align="center"><strong>2.</strong> Form filling and submission</p>

<p align="center"><strong>3.</strong> Dynamic content</p>
{% endcolumn %}
{% endcolumns %}

### Response Format Best Practices <a href="#response-format-best-practices" id="response-format-best-practices"></a>

{% hint style="success" %}
Use `response_format` whenever possible to yield the best & most reliable results:
{% endhint %}

**Tips for designing schemas:**

* Try a few different schemas to find what works best
* If you ask for a `company_name` field but there is no `company_name` on the page, LLM scraping will fail
* Design your schema carefully based on the actual content structure
* Response format is available for both `scrape` and `agent.run`

**Example of good schema design:**

```python
from pydantic import BaseModel

class Product(BaseModel):
    product_url: str
    name: str
    price: float | None = None
    description: str | None = None
    image_url: str | None = None

class ProductList(BaseModel):
    products: list[Product]

```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.lumiteh.com/guides/advanced-scraping.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
