Most AI projects fail. Yours doesn’t have to.
Reserve your spot today and get a production-ready Agent Blueprint in just 3 weeks
6
spots‍
‍available
Register for Your Agent Blueprint
About
Capabilities
Custom AgentsReliable RAGCustom Software DevelopmentEval Driven DevelopmentObservability
LangChainCase StudiesFocused Lab
Contact us
Use Cases

How to Build a PDF RAG Pipeline Without Text Extraction (Using Native PDF Embeddings)

Build a RAG pipeline that embeds raw PDF bytes with Gemini Embedding 2. No text extraction for embedding, vision-model OCR for the text you need, and better retrieval than extract-then-embed

Mar 18, 2026

By
Austin Vance
Share:

All RAG pipelines that handles PDFs does the same thing: extract text, chunk it, embed the chunks. The entire pipeline assumes that the text is the document. But text is often only a small part of the full story.

A court filing has headers, case numbers in specific positions, indented paragraphs that signal legal structure, signature blocks. A financial report has columns, footnotes, charts with captions. When you extract the text, you throw all that away. The embedding model sees a flat string. The document's visual structure, the thing people use to understand it, is stripped.

Google's Gemini Embedding 2 addresses this. You send it the raw PDF bytes. It reads the document just how a human would, layout, formatting, tables, everything, are all part of the embedding vector. Rather than text extraction and chunking. The embedding model sees the full document as you would.

We rebuilt a legal RAG pipeline around this. 215 scanned court documents, zero extractable text (every page is a scanned image), and the retrieval quality beats the extract-then-embed. This post shows how you can use LangGraph & Gemini Embedding 2 to build your own RAG.

The Ingestion Math

Traditional PDF ingestion has five lossy steps:

Step What You Lose Latency
PDF → Text extraction Layout, formatting, tables ~2s/page
Text → Character chunking Cross-page context, sections ~0.01s
Chunks → Text embedding (OpenAI) Nothing — but input is degraded ~0.1s/chunk
Total (20-page doc, ~15 chunks) ~42s

‍

Native PDF ingestion has two steps:

‍

Step What You Lose Latency
PDF → Split into ≤6-page groups Nothing — pages are natural boundaries ~0.01s
Page group → Gemini Embedding 2 Nothing — model sees the actual PDF ~1s/chunk
Total (20-page doc, 4 chunks) ~4s

‍

That's faster, fewer API calls, no paid parsing service, and the embedding model sees more information. The main tradeoff is your minimum chunk granularity is 1 page instead of 500 characters. For documents with visual structure, which is most real-world documents, that's a better tradeoff.

The Architecture

PDF ──► pymupdf splits into ≤6-page groups
              │
              ├──► Gemini Embedding 2 (raw PDF bytes → vector)
              │         Sees layout, tables, formatting natively
              │
              └──► Gemini 2.5 Pro OCR (raw PDF bytes → text)
                        For full-text search + LLM context
                              │
                              ▼
                     In-Memory Vector Store
                     (production: pgvector + tsvector)
                              │
User Query ──► Embed Query ──► Vector Search ──► Grade ──► Generate
                                    ▲                │
                                    │          (not relevant)
                                    └── Rewrite ◄────┘

‍

The key insight: two models, same PDF bytes, different purposes. Gemini Embedding 2 produces the vector (captures visual semantics). Gemini 2.5 Pro produces the text (for keyword search and LLM context). Neither requires text extraction or image rendering and both accept raw PDF bytes.

State: Track Retrieval Quality

The state follows a corrective RAG pattern: retrieve, grade, rewrite if needed, and generate with citations.

from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class RAGState(TypedDict):
    question: str
    rewritten_query: str
    documents: list[dict]
    relevance_score: float
    rewrite_count: int
    max_rewrites: int
    answer: str
    citations: list[dict]
    confidence: str

‍

PDF Splitting: Pages as Natural Chunks

Gemini Embedding 2 accepts up to 6 PDF pages per request. For longer documents, split into page groups using pymupdf. This is the only thing pymupdf does, it's a splitter, not a text extractor.

from pathlib import Path

import fitz  # pymupdf

MAX_PAGES_PER_CHUNK = 6


def split_pdf(filepath: Path) -> list[dict]:
    """Split a PDF into groups of up to 6 pages. Returns raw PDF bytes per group."""
    doc = fitz.open(filepath)
    chunks = []
    for start in range(0, len(doc), MAX_PAGES_PER_CHUNK):
        end = min(start + MAX_PAGES_PER_CHUNK, len(doc))
        sub_doc = fitz.open()
        sub_doc.insert_pdf(doc, from_page=start, to_page=end - 1)
        chunks.append({
            "pdf_bytes": sub_doc.tobytes(),
            "page_numbers": list(range(start + 1, end + 1)),
            "source": filepath.name,
        })
        sub_doc.close()
    doc.close()
    return chunks

‍

A 20-page document becomes 4 chunks. A 3-page document is 1 chunk. Each chunk preserves cross-page context, a sentence spanning pages 5-6 is captured within a single embedding. Compare this to text chunking at 500 characters, where that sentence gets split at an arbitrary boundary.

Native PDF Embedding

This is the core innovation. Send raw PDF bytes to Gemini Embedding 2. The model natively understands document layout, headers, tables, formatting, visual hierarchy, and produces an embedding vector optimized for retrieval.

import math

from google import genai
from google.genai import types

gemini_client = genai.Client()

EMBEDDING_MODEL = "gemini-embedding-2-preview"
EMBEDDING_DIMS = 1536


def normalize(embedding: list[float]) -> list[float]:
    """L2-normalize an embedding vector (required for Gemini at non-default dims)."""
    norm = math.sqrt(sum(x * x for x in embedding))
    return [x / norm for x in embedding] if norm > 0 else embedding


@traceable(name="embed_pdf_chunk", run_type="embedding")
def embed_pdf_chunk(pdf_bytes: bytes) -> list[float]:
    """Embed raw PDF bytes using Gemini Embedding 2. No text extraction."""
    result = gemini_client.models.embed_content(
        model=EMBEDDING_MODEL,
        contents=types.Content(
            parts=[types.Part.from_bytes(data=pdf_bytes, mime_type="application/pdf")]
        ),
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_DOCUMENT",
            output_dimensionality=EMBEDDING_DIMS,
        ),
    )
    return normalize(list(result.embeddings[0].values))

‍

That's it. No RecursiveCharacterTextSplitter. No OpenAIEmbeddings. The PDF goes in, a vector comes out.

Asymmetric task types matter. Documents are embedded with RETRIEVAL_DOCUMENT, queries with RETRIEVAL_QUERY. Gemini Embedding 2 optimizes the embedding space differently for each, query vectors are tuned to find relevant documents, not to be similar to them. This is a meaningful retrieval quality improvement over symmetric embedding.

‍

OCR_MODEL = "gemini-2.5-pro"


@traceable(name="ocr_pdf_chunk", run_type="chain")
def ocr_pdf_chunk(pdf_bytes: bytes) -> str:
    """OCR a PDF chunk by sending raw bytes to Gemini. No image rendering."""
    response = gemini_client.models.generate_content(
        model=OCR_MODEL,
        contents=[
            types.Content(parts=[
                types.Part.from_text(
                    text=(
                        "Transcribe all visible text exactly as it appears. "
                        "Preserve paragraph structure, headings, and formatting. "
                        "Return ONLY the transcribed text, nothing else."
                    )
                ),
                types.Part.from_bytes(data=pdf_bytes, mime_type="application/pdf"),
            ])
        ],
        config=types.GenerateContentConfig(temperature=0, max_output_tokens=8192),
    )
    return response.text.strip()

‍

No page.get_pixmap(). No base64.b64encode(). No rendering pages to PNG images. The model reads the PDF directly. This eliminates the memory overhead that causes OOM crashes on large documents, rendering a 67-page scanned PDF to 2x-resolution PNGs requires gigabytes of RAM. Sending the raw bytes requires kilobytes.

Ingestion: Two Models, Same Bytes

The complete ingestion function processes each PDF through both models in sequence:

‍

DOCUMENT_STORE: list[dict] = []  # In-memory; use pgvector in production


@traceable(name="ingest_pdf", run_type="chain")
def ingest_pdf(filepath: Path) -> int:
    """Ingest a PDF: split pages, embed raw bytes, OCR for text."""
    page_chunks = split_pdf(filepath)
    for chunk in page_chunks:
        embedding = embed_pdf_chunk(chunk["pdf_bytes"])
        text = ocr_pdf_chunk(chunk["pdf_bytes"])
        DOCUMENT_STORE.append({
            "content": text,
            "embedding": embedding,
            "source": chunk["source"],
            "page_numbers": chunk["page_numbers"],
        })
    return len(page_chunks)

‍

Each chunk gets two things from the same PDF bytes: a vector that captures visual document semantics (for semantic search) and OCR'd text (for keyword search and LLM context).

Node 1: Retrieve

Vector search using cosine similarity on the native PDF embeddings. In production, this is a pgvector HNSW query. Here we use in-memory search to keep the example self-contained.

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a > 0 and norm_b > 0 else 0.0


@traceable(name="retrieve", run_type="retriever")
def retrieve_node(state: RAGState) -> dict:
    query = state.get("rewritten_query") or state["question"]
    query_embedding = embed_query(query)
    scored = []
    for doc in DOCUMENT_STORE:
        score = cosine_similarity(query_embedding, doc["embedding"])
        scored.append({**doc, "score": score, "embedding": None})
    scored.sort(key=lambda x: x["score"], reverse=True)
    return {"documents": scored[:5]}

‍

The retrieval here is identical to text-based RAG. The difference is what's in the vectors, PDF-native embeddings capture layout and visual structure, not only text semantics.

Node 2: Grade Relevance

Use structured output for clean boolean grading with with_structured_output.

‍

from pydantic import BaseModel, Field


class RelevanceGrade(BaseModel):
    """Grade the relevance of retrieved documents to the user question."""

    relevant: bool = Field(description="Are the documents relevant to answering the question?")
    reasoning: str = Field(description="Brief explanation of the relevance assessment.")


grading_llm = llm.with_structured_output(RelevanceGrade)


@traceable(name="grade_relevance", run_type="chain")
def grade_node(state: RAGState) -> dict:
    docs_text = "\n\n".join(
        f"[Source: {d.get('source', 'unknown')}]\n{d['content'][:500]}"
        for d in state["documents"]
    )
    grade = grading_llm.invoke([
        {
            "role": "system",
            "content": (
                "You are a relevance grader. Given a user question and retrieved documents, "
                "determine if the documents contain information that can answer the question. "
                "Be strict: tangentially related but non-answering documents are not relevant."
            ),
        },
        {
            "role": "user",
            "content": f"Question: {state['question']}\n\nRetrieved documents:\n{docs_text}",
        },
    ])
    return {"relevance_score": 1.0 if grade.relevant else 0.0}

‍

Node 3: Rewrite Query

When retrieval fails, rewrite the query. Same pattern as text-based RAG — the rewriter doesn't know or care that the embeddings came from PDF bytes:

‍

@traceable(name="rewrite_query", run_type="chain")
def rewrite_node(state: RAGState) -> dict:
    response = llm.invoke([
        {
            "role": "system",
            "content": (
                "You are a query rewriter. The original query returned irrelevant documents. "
                "Rewrite the query to be more specific and targeted. "
                "Return ONLY the rewritten query, nothing else."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Original question: {state['question']}\n\n"
                "Rewrite the query to find more relevant documents:"
            ),
        },
    ])
    return {
        "rewritten_query": response.content,
        "rewrite_count": state["rewrite_count"] + 1,
    }

‍

Node 4: Generate with Citations

The generator sees the OCR'd text — not the embeddings. Citations map to source filenames and page numbers:

@traceable(name="generate", run_type="chain")
def generate_node(state: RAGState) -> dict:
    docs_text = "\n\n".join(
        f"[{i + 1}] (Source: {d.get('source', 'unknown')}, Pages: {d.get('page_numbers', [])})\n"
        f"{d['content']}"
        for i, d in enumerate(state["documents"])
    )

    is_low_confidence = state.get("relevance_score", 0) < 1.0

    system_prompt = (
        "You are a document assistant. Answer the user's question using ONLY the provided "
        "documents. Cite sources using [1], [2], etc. If the documents don't contain enough "
        "information, say so — do not make up information."
    )
    if is_low_confidence:
        system_prompt += (
            "\n\nWARNING: Retrieved documents may not be fully relevant. "
            "Be especially careful. Prefix with 'Note: Based on limited context.'"
        )

    response = llm.invoke([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Question: {state['question']}\n\nDocuments:\n{docs_text}"},
    ])

    citations = [
        {"index": i + 1, "source": d.get("source", "unknown"), "pages": d.get("page_numbers", [])}
        for i, d in enumerate(state["documents"])
    ]

    return {
        "answer": response.content,
        "citations": citations,
        "confidence": "low" if is_low_confidence else "high",
    }

‍

Graph Assembly

The graph follows the corrective RAG pattern. The only difference from text-based RAG is what's in the documents, PDF-native embeddings and OCR'd text instead of extracted-and-chunked text.

‍

from langgraph.graph import END, START, StateGraph
from langgraph.types import RetryPolicy


def route_after_grading(state: RAGState) -> str:
    if state.get("relevance_score", 0) >= 1.0:
        return "generate"
    if state["rewrite_count"] >= state["max_rewrites"]:
        return "generate"
    return "rewrite"


builder = StateGraph(RAGState)

builder.add_node("retrieve", retrieve_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("grade", grade_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("rewrite", rewrite_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("generate", generate_node, retry=RetryPolicy(max_attempts=3))

builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "grade")
builder.add_conditional_edges(
    "grade",
    route_after_grading,
    {"generate": "generate", "rewrite": "rewrite"},
)
builder.add_edge("rewrite", "retrieve")
builder.add_edge("generate", END)

graph = builder.compile()

‍

Running It

if __name__ == "__main__":
    from langsmith import tracing_context

    # Ingest PDFs
    ingest_pdf(Path("docs/sample.pdf"))

    with tracing_context(
        metadata={"pipeline": "native-pdf-rag", "version": "v1"},
        tags=["production", "native-pdf-rag"],
    ):
        result = graph.invoke({
            "question": "What is this document about?",
            "rewritten_query": "",
            "documents": [],
            "relevance_score": 0.0,
            "rewrite_count": 0,
            "max_rewrites": 2,
            "answer": "",
            "citations": [],
            "confidence": "",
        })

    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Rewrites: {result['rewrite_count']}")

‍

Production Failures

1. 6-page limit truncation. A 67-page legal brief gets split into 12 chunks. If the answer spans chunks 3 and 7, neither chunk alone answers the question, and the grader marks both as irrelevant. Fix: after retrieval, expand top-scoring chunks to include sibling chunks from the same document. With a 200K context window, you can fit 3-4 complete documents.

2. OCR hallucination on degraded scans. Vision-model OCR is probabilistic. On low-quality scans, the model might "smooth over" illegible text instead of marking it uncertain. For legal or financial documents, this is dangerous. Fix: use the best available OCR model (Gemini 2.5 Pro, not Flash). Instruct it to output [illegible] rather than guessing. Verify critical fields (dates, amounts, case numbers) against structured metadata.

3. Embedding space mismatch between PDF and text queries. Document embeddings come from raw PDF bytes. Query embeddings come from text strings. Gemini's RETRIEVAL_DOCUMENT / RETRIEVAL_QUERY task types handle this asymmetry, but edge cases exist where a text query doesn't match the visual embedding well. Fix: hybrid search. Combine vector search (PDF embeddings) with full-text search (OCR'd text via tsvector/BM25) using Reciprocal Rank Fusion.

4. Memory pressure from sub-PDF creation. pymupdf's insert_pdf creates in-memory copies. For a 200MB scanned PDF, splitting into 6-page groups temporarily holds the full document plus all sub-documents in memory. Fix: process the sub-PDFs as they're created, then discard. Don't hold all chunks in memory simultaneously.

5. Rate limiting on Gemini embedding API. Each 6-page group requires one embedding call. A 215-document corpus with an average of 8 pages per document generates ~350 embedding calls. At Gemini's default rate limits, this can take 30+ minutes. Fix: batch processing with exponential backoff. Use the Gemini Batch API at 50% of the default embedding price.

‍

Observability

Every function has @traceable with appropriate run_type. In LangSmith, a single query trace shows:

invoke_agent (chain)
├── embed_query (embedding)          ← Query → Gemini Embedding 2
├── retrieve (retriever)             ← Cosine similarity search
├── grade_relevance (chain)          ← Structured output grading
├── generate (chain)                 ← Claude generates answer

And an ingestion trace shows:

ingest_pdf (chain)
├── embed_pdf_chunk (embedding)      ← PDF bytes → Gemini Embedding 2
├── ocr_pdf_chunk (chain)            ← PDF bytes → Gemini 2.5 Pro
├── embed_pdf_chunk (embedding)      ← next chunk...
└── ocr_pdf_chunk (chain)

‍

The embedding run type is critical — it lets LangSmith show you embedding latency separately from LLM latency, so you can diagnose whether retrieval or generation is the bottleneck.

Evals

The eval suite tests three axes: does the retriever find content (retrieval quality), does the answer use only that content (faithfulness), and does the answer address the question (relevance).

from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

ls_client = Client()
DATASET_NAME = "native-pdf-rag-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Native PDF RAG pipeline evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {"question": "What is this document about?"},
            {"question": "Who are the parties or people mentioned?"},
            {"question": "What dates are referenced in the document?"},
            {"question": "What is the main argument or conclusion?"},
        ],
        outputs=[
            {"expected_type": "summary"},
            {"expected_type": "entity_extraction"},
            {"expected_type": "date_extraction"},
            {"expected_type": "argument_analysis"},
        ],
    )

‍

LLM-as-judge for faithfulness and relevance:

FAITHFULNESS_PROMPT = """\
Question: {inputs[question]}
Retrieved documents and generated answer: {outputs[answer]}

Rate 0.0-1.0 on faithfulness: Is every claim in the answer supported by the
retrieved documents? Unsupported inferences score 0.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

faithfulness_judge = create_llm_as_judge(
    prompt=FAITHFULNESS_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="faithfulness",
    continuous=True,
)

‍

Custom deterministic evaluators:

def retrieval_has_content(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did retrieval return documents with actual OCR'd text?"""
    docs = outputs.get("documents", [])
    has_content = any(len(d.get("content", "")) > 50 for d in docs)
    return {"key": "retrieval_has_content", "score": 1.0 if has_content else 0.0}


def rewrite_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Fewer rewrites = better initial retrieval. PDF embeddings should need fewer."""
    count = outputs.get("rewrite_count", 0)
    score = 1.0 if count == 0 else (0.7 if count == 1 else 0.4)
    return {"key": "rewrite_efficiency", "score": score}

‍

Run the suite with python evals.py. Check LangSmith for score distributions. If rewrite_efficiency averages below 0.7, your PDF embeddings aren't matching query semantics well — consider adding hybrid search with tsvector on the OCR'd text.

When to Use This

Use native PDF embedding when:

  • Documents are scanned images (no extractable text)
  • Layout carries meaning (legal filings, financial reports, engineering specs)
  • You want to eliminate text extraction as a failure point
  • You need fewer pipeline steps and API dependencies

Skip it when:

  • Documents are plain text (Markdown, .txt) with no visual structure
  • You need sub-page retrieval granularity (minimum chunk is 1 page)
  • You need to embed thousands of documents per second (PDF processing is slower than text embedding)
  • Your documents exceed 6 pages and need fine-grained cross-page retrieval

What This Replaces

Before After
Text Extractor ($0.01/page) pymupdf page splitting (free, local)
Text extraction → character chunking Raw PDF bytes → Gemini Embedding 2
OpenAI text-embedding-3-large Gemini Embedding 2 (native PDF)
Tesseract / Cloud OCR → image rendering Gemini 2.5 Pro (raw PDF bytes, no rendering)
5 lossy transformation steps 1 step: PDF bytes in, vector out

‍

The Bottom Line

The extract-then-embed pipeline was built around the limitations of text-only embedding models. Gemini Embedding 2 removes that limitation. If your documents have visual structure — and most real-world documents do — embedding the document directly produces better vectors than embedding extracted text. The pipeline simplifies from five lossy steps to one, with fewer dependencies, fewer failure modes, and better retrieval.

Technical References:

  • Stop Extracting Text from PDFs GitHub repo
  • Thinking in LangGraph
  • with_structured_output
  • GoogleGenerativeAIEmbeddings integration

Your message has been sent!

We’ll be in touch soon. In the mean time check out our case studies.

See all projects
/Contact Us

Let's Build better Agents Together

Modernize your legacy with Focused

Get in touch
Focused

433 W Van Buren St

Suite 1100-C
Chicago, IL 60607

‍
‍work@focused.io
‍
(708) 303-8088

About
Leadership
Capabilities
Case Studies
Focused Lab
Careers
Contact
© 2026 Focused. All rights reserved.
Privacy Policy