Skip to main content
Readers transform raw content into Document objects that can be chunked, embedded, and stored in your knowledge base. Each reader handles a specific format (PDF, CSV, Markdown, etc.) and extracts text and metadata.
from agno.knowledge.reader.pdf_reader import PDFReader

reader = PDFReader(chunk=True, chunk_size=5000)
documents = reader.read("company_handbook.pdf")

How Readers Work

  1. Parse: Read the raw content using format-specific logic
  2. Extract: Pull out text and metadata (page numbers, authors, etc.)
  3. Chunk: Split large content into smaller pieces (if enabled)
  4. Return: Provide a list of Document objects ready for embedding
# Output structure
Document(
    content="The extracted text...",
    id="unique_id",
    name="document_name",
    meta_data={"page": 1, "source": "handbook.pdf"},
)

Supported Readers

ReaderDescription
PDFReaderExtract text from PDF files
TextReaderPlain text files
MarkdownReaderMarkdown files
CSVReaderCSV files (rows become documents)
FieldLabeledCSVReaderCSV rows as field-labeled text
JSONReaderJSON files
PPTXReaderPowerPoint presentations
ArxivReaderAcademic papers from arXiv
WikipediaReaderWikipedia articles
YouTubeReaderYouTube transcripts
WebsiteReaderCrawl websites recursively
WebSearchReaderWeb search results
FirecrawlReaderWeb scraping via Firecrawl API

Using Readers with Knowledge

Pass a reader to knowledge.insert() to override automatic format detection:
from agno.knowledge.knowledge import Knowledge
from agno.knowledge.reader.pdf_reader import PDFReader

knowledge = Knowledge(vector_db=vector_db)

# Use custom reader configuration
reader = PDFReader(chunk_size=3000, split_on_pages=True)
knowledge.insert(path="documents/", reader=reader)

Auto-Selection

Agno automatically selects the right reader based on file extension or URL:
from agno.knowledge.reader.reader_factory import ReaderFactory

# By file extension
reader = ReaderFactory.get_reader_for_extension(".pdf")  # PDFReader
reader = ReaderFactory.get_reader_for_extension(".csv")  # CSVReader

# By URL
reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader
When using knowledge.insert(), this happens automatically.

Configuration

Chunking

reader = PDFReader(
    chunk=True,           # Enable chunking (default: True)
    chunk_size=5000,      # Characters per chunk
)

Format-Specific Options

# PDF with encryption and OCR
reader = PDFReader(
    password="secret",
    read_images=True,     # OCR for images
    split_on_pages=True,  # One document per page
)

# CSV with custom encoding
reader = CSVReader(
    encoding="latin-1",
)

# Text with encoding override
reader = TextReader(
    encoding="utf-8",
)

Runtime Options

Override settings when calling read():
documents = reader.read(
    "file.pdf",
    name="custom_document_name",  # Override default naming
    password="runtime_password",  # Password at read time
)

Async Processing

All readers support async for better performance with I/O operations:
import asyncio

# Single file
documents = await reader.async_read("file.pdf")

# Batch processing
tasks = [reader.async_read(file) for file in files]
all_documents = await asyncio.gather(*tasks)

Custom Chunking Strategy

Override the default chunking behavior:
from agno.knowledge.chunking.semantic_chunking import SemanticChunking

reader = PDFReader(
    chunk=True,
    chunking_strategy=SemanticChunking(),
)
See Chunking for available strategies.

Error Handling

Readers return an empty list when processing fails. Check logs for debugging information:
documents = reader.read("corrupted.pdf")
if not documents:
    print("Failed to read file, check logs for details")

Next Steps