Skip to main content
  • From Content
  • From Page or Locator
async def extract_structured_data(
    *,
    content: list[ContentItem] | ContentItem,
    data_schema: type[BaseModel] | dict[str, Any],
    prompt: str | None = None,
    max_retires: int = 3,
    enable_cache: bool = True,
    model: SUPPORTED_MODELS = 'claude-3-5-haiku-latest',
    api_key: str | None = None,
) -> Any
Extracts structured data from content items (text, images) using AI-powered analysis.This overload provides a simplified interface for data extraction from various content types without requiring a page source or extraction strategy. It accepts text content, image buffers, or image URLs and extracts structured data according to the provided schema.

Examples

from pydantic import BaseModel, Field
from intuned_browser.ai import extract_structured_data, TextContentItem
# Define schema using Pydantic
class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age")
    occupation: str = Field(description="Person's job title")
    company: str = Field(description="Company name")
async def automation(page, params, **_kwargs):
    text_content: TextContentItem = {
        "type": "text",
        "data": "John Doe, age 30, works as a Software Engineer at Tech Corp"
    }
    person = await extract_structured_data(
        content=text_content,
        model="gpt-4o",
        data_schema=Person,  # Pass Pydantic model directly
        prompt="Extract person information from the text"
    )
    print(f"Found person: {person['name']}, {person['age']} years old")

Arguments

content
list[ContentItem] | ContentItem
required
Content to extract data from - can be a single content item or array of content items. Check ContentItem for more details.
data_schema
BaseModel | dict[str, Any]
required
Schema defining the expected structure of the extracted data. Can be either a Pydantic BaseModel class or a JSON Schema dictionary.
prompt
str
default:"None"
Optional prompt to guide the extraction process and provide more context. Defaults to None.
max_retires
int
default:"3"
Maximum number of retry attempts on failures. Failures can be validation errors, API errors, output errors, etc. Defaults to 3.
enable_cache
bool
default:"True"
Whether to enable caching of the extracted data. Defaults to True.
model
SUPPORTED_MODELS
default:"'claude-3-5-haiku-latest'"
AI model to use for extraction. See SUPPORTED_MODELS for all supported models. Defaults to “claude-3-5-haiku-latest”.
api_key
str
default:"None"
Optional API key for AI extraction (if provided, will not be billed to your account). Defaults to None.

Returns: Any

The extracted structured data conforming to the provided schema.
Key Features & Limitations
  • No DOM Matching: This overload does not support DOM matching since it doesn’t operate on web pages.
  • Smart Caching: Caching is based on content hash to avoid redundant API calls.
  • Automatic Image Fetching: Image URLs are automatically fetched and converted to image buffers for processing.
  • Batch Processing: Multiple content items can be processed together for comprehensive extraction.