extract_structured_data

From Content
From Page or Locator

async def extract_structured_data(
    *,
    content: list[ContentItem] | ContentItem,
    data_schema: type[BaseModel] | dict[str, Any],
    prompt: str | None = None,
    max_retires: int = 3,
    enable_cache: bool = True,
    model: SUPPORTED_MODELS = 'claude-3-5-haiku-latest',
    api_key: str | None = None,
) -> Any

Extracts structured data from content items (text, images) using AI-powered analysis.This overload provides a simplified interface for data extraction from various content types without requiring a page source or extraction strategy. It accepts text content, image buffers, or image URLs and extracts structured data according to the provided schema.

Examples

from pydantic import BaseModel, Field
from intuned_browser.ai import extract_structured_data, TextContentItem
# Define schema using Pydantic
class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age")
    occupation: str = Field(description="Person's job title")
    company: str = Field(description="Company name")
async def automation(page, params, **_kwargs):
    text_content: TextContentItem = {
        "type": "text",
        "data": "John Doe, age 30, works as a Software Engineer at Tech Corp"
    }
    person = await extract_structured_data(
        content=text_content,
        model="gpt-4o",
        data_schema=Person,  # Pass Pydantic model directly
        prompt="Extract person information from the text"
    )
    print(f"Found person: {person['name']}, {person['age']} years old")

Arguments

content

list[ContentItem] | ContentItem

required

Content to extract data from - can be a single content item or array of content items. Check ContentItem for more details.

data_schema

BaseModel | dict[str, Any]

required

Schema defining the expected structure of the extracted data. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.

prompt

str

default:"None"

Optional prompt to guide the extraction process and provide more context. Defaults to None.

max_retires

int

default:"3"

Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.

enable_cache

bool

default:"True"

Whether to enable caching of the extracted data. Defaults to True.

model

SUPPORTED_MODELS

default:"'claude-3-5-haiku-latest'"

AI model to use for extraction. See SUPPORTED_MODELS for all supported models. Defaults to “claude-3-5-haiku-latest”.

api_key

str

default:"None"

Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.

Returns: `Any`

The extracted structured data conforming to the provided schema.

Key Features & Limitations

No DOM Matching: This overload does not support DOM matching since it doesn’t operate on web pages.
Smart Caching: Caching is based on content hash to avoid redundant API calls.
Automatic Image Fetching: Image URLs are automatically fetched and converted to image buffers for processing.
Batch Processing: Multiple content items can be processed together for comprehensive extraction.

Get started

​Examples

​Arguments

​Returns: Any

Examples

Arguments

Returns: `Any`