How to automate document parsing with OpenClaw
May 08, 2026
/
Domantas P.
/
22 min Read
To automate document parsing with OpenClaw, configure its built-in PDF parser, choose the right extraction method for each document type, and validate the parsed output before using it in a workflow. This lets OpenClaw turn PDFs, invoices, contracts, scanned files, tables, and resumes into structured data such as JSON, CSV, Markdown, or searchable text.
A reliable OpenClaw document parsing setup follows five steps:
- Configure OpenClaw’s PDF parser and fallback models.
- Choose the right parsing method for each document type.
- Extract structured fields with a fixed schema.
- Use OCR for scanned or image-only PDFs.
- Validate the parsed data before saving or reusing it.
This guide explains what OpenClaw can parse out of the box, how to configure its PDF parser, when to use OCR or table extraction, how to extract structured fields from PDFs, and how to fix common document parsing errors. You’ll also learn when to use a managed setup like 1-Click OpenClaw and when a self-managed environment is better for local models, custom OCR, or private document processing.
What OpenClaw can parse out of the box
OpenClaw can parse text-based PDFs with its built-in PDF tool, so most digital invoices, contracts, reports, resumes, manuals, and exported documents can be processed without a separate document parsing skill. The PDF tool extracts text from one or more files and supports up to 10 PDFs per call.
Out of the box, OpenClaw works best with documents that contain selectable text, including:
- Business documents: invoices, receipts, contracts, proposals, and reports.
- Professional documents: resumes, manuals, white papers, and exported forms.
- Readable PDF layouts: statements, simple tables, and structured business files.
- Multi-document tasks: related PDFs that need to be compared or summarized together.
OpenClaw has two default parsing modes. In native PDF mode, it sends the full PDF directly to the selected model provider. This works best for regular PDFs with selectable text, but page selection is not supported in this mode. In fallback mode, OpenClaw extracts text from selected pages first and can render low-text pages as images when page-image rendering is available.
By default, OpenClaw processes up to 20 pages per PDF and accepts files up to 10 MB, based on the pdfMaxPages and pdfMaxBytesMb settings. You can adjust these limits in the OpenClaw configuration file if your workflow requires handling longer reports or larger files.
OpenClaw needs extra help when a file contains no readable text. Scanned PDFs, image-only documents, receipt photos, handwritten forms, and low-quality archives often require optical character recognition (OCR) for reliable extraction. If the selected model does not support image input and OpenClaw cannot extract text from the PDF, the PDF tool can fail instead of returning parsed content.
In practice, use the built-in PDF tool first for digital PDFs. Add OCR or a specialized extraction skill only when the document is scanned, table-heavy, image-based, or needs highly structured fields such as invoice line items, tax amounts, renewal dates, or candidate records.
How to configure OpenClaw’s PDF parser
To configure OpenClaw’s PDF parser, update the PDF model, fallback models, file-size limit, and page limit in your ~/.openclaw/openclaw.json file. These settings control how OpenClaw reads PDFs, when it switches to fallback parsing, and how much of each document it can process.
Open your OpenClaw configuration file:
~/.openclaw/openclaw.json
Then add or update the PDF parser settings under agents.defaults:
{
"agents": {
"defaults": {
"pdfModel": {
"primary": "claude-sonnet-4-5",
"fallbacks": ["gemini-2.0-flash", "gpt-4o"]
},
"pdfMaxBytesMb": 25,
"pdfMaxPages": 50
}
}
}The pdfModel.primary setting defines the first model OpenClaw uses to parse a PDF. Use a model with strong PDF or vision support when the document layout matters, such as contracts, invoices, reports, resumes, forms, or manuals.
The pdfModel.fallbacks setting defines the backup models OpenClaw tries when the primary model fails, times out, or reaches a rate limit. Fallbacks make parsing more reliable because OpenClaw has another way to process the document, rather than stopping after a single failed model call.
The pdfMaxBytesMb setting controls the maximum PDF file size. Increase this value if you parse long contracts, annual reports, scanned archives, or bundled documents that exceed the default file limit. Keep the limit close to your actual document size, as larger PDFs take longer to process and use more model context.
The pdfMaxPages setting controls how many pages OpenClaw can parse from each PDF. Raise this value for long reports or contract bundles, but avoid sending unnecessary pages when the workflow only needs a specific part of the document, such as invoice totals, renewal clauses, financial tables, or signature pages.
If you want OpenClaw to parse only specific pages, use a page range when fallback parsing is active:
{
"file": "/workspace/contracts/vendor-agreement.pdf",
"pages": "1-5,12,18-20"
}Page ranges are useful for extracting targeted sections from long PDFs. For example, you can parse only the first five pages of a contract, the signature page, or the pages that contain financial tables. Native PDF mode may ignore page-range instructions because the full PDF is sent directly to the model provider. If exact page control matters, split the PDF first or use fallback parsing.
For a managed setup, use 1-Click OpenClaw to run OpenClaw without manually configuring the server. This works well if you want the agent to stay online while it parses PDFs, extracts fields, and validates document data. Choose a self-managed VPS setup instead if your parsing workflow needs root access, local models, custom OCR binaries, or private infrastructure controls.
For safer parsing, keep source files inside the OpenClaw workspace and enable a workspace-only file policy if your setup supports it:
{
"filePolicy": "workspace-only"
}This setting helps prevent OpenClaw from reading files outside the approved workspace folder. It also makes parsing more predictable because every PDF, extracted text file, and structured output stays in a controlled location.
After saving the configuration, test the parser with three sample files: a text-based PDF, a table-heavy PDF, and a scanned PDF. The text-based PDF should confirm that normal extraction works. The table-heavy PDF should show whether you need a table extraction skill. The scanned PDF should confirm whether OCR or a vision-capable fallback model is required.
Which parsing method should you use for each document type?
Choose the OpenClaw parsing method based on the document’s format, text layer, layout complexity, and required output. Use the built-in PDF tool for readable PDFs, OCR for scanned files, table extraction for row-based data, and specialized extraction skills for documents that need strict field-level accuracy.
A simple decision rule works for most cases: use the built-in PDF tool when the text is readable, OCR when the document is image-based, table extraction when rows and columns matter, and a specialized schema when the output must follow a fixed structure.
For example, a digital vendor contract should start with the built-in PDF tool because OpenClaw can read the text and extract clauses directly. A scanned contract should run through OCR first because the page image needs to be converted to readable text before OpenClaw can extract parties, dates, and obligations. An invoice with line items should use structured extraction, as the final output requires reliable accounting fields rather than a general summary.
If a document fits more than one category, choose the method based on the output you need. A scanned invoice needs OCR first, then structured invoice extraction. A financial report with selectable text still needs table extraction if the final deliverable is a CSV. A resume with readable text still needs a fixed schema if the goal is a normalized hiring record.
How to extract structured fields from PDFs
To extract structured fields from PDFs with OpenClaw, give the agent a fixed schema that tells it which values to find, how to format them, and what to do when a field is missing. This converts a PDF from unstructured text into structured JSON, CSV, or Markdown.
Start with a clear extraction command. A vague prompt like “read this PDF” gives OpenClaw too much freedom, while a field-level instruction tells the parser exactly what to return.
Extract structured fields from this PDF.
Return the result as JSON only.
Use this schema:
{
"document_type": "string",
"title": "string",
"sender_or_author": "string",
"date": "YYYY-MM-DD or unknown",
"summary": "string",
"key_entities": ["string"],
"important_dates": [
{
"date": "YYYY-MM-DD or unknown",
"description": "string"
}
],
"amounts": [
{
"amount": "number",
"currency": "string",
"description": "string"
}
],
"missing_fields": ["string"]
}
If a value is not shown in the PDF, use "unknown" or add it to "missing_fields". Do not guess.This schema works for general business documents because it captures the document type, source, date, summary, entities, dates, amounts, and missing fields. For better accuracy, adapt the schema to the document type before parsing.
For invoices, use fields that match accounting workflows:
Extract this invoice into structured JSON.
Return:
{
"vendor_name": "string",
"invoice_number": "string",
"issue_date": "YYYY-MM-DD or unknown",
"due_date": "YYYY-MM-DD or unknown",
"subtotal": "number or unknown",
"tax": "number or unknown",
"total_amount": "number or unknown",
"currency": "string or unknown",
"payment_terms": "string or unknown",
"line_items": [
{
"description": "string",
"quantity": "number or unknown",
"unit_price": "number or unknown",
"line_total": "number or unknown"
}
],
"missing_fields": ["string"]
}
Do not calculate missing totals unless the PDF explicitly shows enough values to verify them.For contracts, use a schema that extracts clauses, dates, parties, and obligations:
Extract structured contract fields from this PDF.
Return JSON only:
{
"contract_type": "string",
"parties": ["string"],
"effective_date": "YYYY-MM-DD or unknown",
"renewal_date": "YYYY-MM-DD or unknown",
"termination_notice_period": "string or unknown",
"payment_terms": "string or unknown",
"confidentiality_clause": "string or unknown",
"liability_clause": "string or unknown",
"obligations": [
{
"party": "string",
"obligation": "string",
"deadline": "string or unknown"
}
],
"important_dates": [
{
"date": "YYYY-MM-DD or unknown",
"event": "string"
}
],
"requires_human_review": "true or false",
"missing_fields": ["string"]
}
If a clause is unclear, summarize what is visible and mark "requires_human_review" as true.For resumes, use a hiring schema that normalizes candidate information:
Parse this resume into a structured candidate record.
Return JSON only:
{
"candidate_name": "string",
"email": "string or unknown",
"phone": "string or unknown",
"location": "string or unknown",
"current_role": "string or unknown",
"years_experience": "number or unknown",
"skills": ["string"],
"work_experience": [
{
"company": "string",
"role": "string",
"start_date": "YYYY-MM or unknown",
"end_date": "YYYY-MM, present, or unknown"
}
],
"education": [
{
"degree": "string or unknown",
"school": "string or unknown",
"graduation_year": "number or unknown"
}
],
"links": {
"linkedin": "string or unknown",
"portfolio": "string or unknown",
"github": "string or unknown"
},
"missing_fields": ["string"]
}
Do not infer missing contact details, employers, dates, or education history.The schema should match the final use of the parsed data. Use JSON when another tool, database, or script will process the result. Use CSV when the output needs to become a spreadsheet row. Use Markdown when a human needs a readable review, such as a contract summary or document brief.
After OpenClaw returns the extracted fields, add a validation step before saving the result. The validation step checks whether the parser found the required values and whether the output is safe to use.
Validate the extracted fields before saving. Check: 1. Required fields are present. 2. Dates use YYYY-MM-DD format. 3. Currency values include a currency code or symbol. 4. Invoice totals match visible subtotal, tax, and line items when possible. 5. Contract dates are linked to the correct clause. 6. OCR or parsing confidence is low, incomplete, or uncertain. 7. Missing fields are listed instead of guessed. If validation fails, save the output as needs_review instead of final.
Structured extraction works best when each PDF type has its own schema. A general schema is sufficient for summaries and classification, but invoices, contracts, resumes, forms, and financial reports require separate field lists because each document type has distinct important entities, dates, amounts, and validation rules.
How to extract tables from PDFs
To extract tables from PDFs with OpenClaw, use a table-focused parsing skill when the document contains rows, columns, repeated headers, merged cells, or financial line items. The built-in PDF tool can summarize table content, but a dedicated table extraction method is better when the final output needs to be a clean CSV or spreadsheet.
Use table extraction for documents where row-level accuracy matters, such as:
- invoices with line items
- financial statements
- price lists
- purchase orders
- bank statements
- shipping records
- inventory reports
- annual reports
- research tables
- payroll summaries
Start with a specific instruction that tells OpenClaw to preserve the table structure instead of summarizing it:
Extract every table from this PDF. For each table: 1. Preserve the original column names when possible. 2. Keep the original row order. 3. Save each table as a separate CSV file. 4. Add the source file name and page number to each output. 5. Do not merge unrelated tables. 6. If a table is unclear, misaligned, or incomplete, mark it as needs_review instead of guessing.
For financial documents, add stricter rules because totals, subtotals, periods, and footnotes affect how the data is interpreted:
Extract the financial tables from this PDF into CSV format. Preserve: - reporting periods - row labels - column labels - subtotals - totals - currency symbols - percentage signs - footnote markers Do not summarize, recalculate, or rewrite the values. Keep the visible numbers exactly as shown in the PDF.
For invoices, separate the invoice summary from the line-item table. This keeps document-level fields and row-level fields clean:
Extract this invoice into two outputs. Create invoice-summary.csv with: vendor_name, invoice_number, issue_date, due_date, subtotal, tax, total, currency Create invoice-line-items.csv with: item_description, quantity, unit_price, tax_rate, line_total If a value is missing or unclear, leave the cell blank and add a note to extraction-notes.txt.
A table extraction skill, such as a pdfplumber-based pdf-extraction skill, is useful when OpenClaw needs more control over how a PDF table is detected. Table-heavy PDFs often contain visual separators, wrapped text, multi-line cells, repeated page headers, or columns without visible borders. A table parser can identify cell boundaries more precisely than a general summary prompt.
When extracting difficult tables, tune the parser for the document family. A bank statement, a vendor invoice, and an annual report may all contain tables, but their layouts are different. Use a single extraction rule for each repeatable document type rather than relying on a single generic prompt for every PDF.
For recurring PDFs from the same source, define a reusable table schema:
For all future invoices from this vendor, use this line-item schema: item_description, sku, quantity, unit_price, tax_rate, discount, line_total Save the output as CSV in: /Documents/Invoices/CSV/[Vendor]/ If the column names in the PDF differ, map them to the closest matching schema field and list the mapping in extraction-notes.txt.
After extracting the table, validate the CSV before using it in reporting, accounting, or operations. PDF tables often look correct visually but are extracted incorrectly because of merged cells, hidden columns, repeated headers, or broken row alignment.
Use this validation instruction:
Validate the extracted CSV before marking it as final. Check for: 1. Missing column headers. 2. Uneven row lengths. 3. Repeated page headers inside the table. 4. Blank required fields. 5. Values split across multiple columns. 6. Totals that do not match visible table totals. 7. Currency or percentage symbols that were dropped. 8. Footnotes that changed the meaning of a value. If any issue appears, save the file and mark it as needs_review.
Use the built-in PDF tool when you only need a readable explanation of a table. Use a table extraction skill when you need spreadsheet-ready rows, stable column names, and values that can be reused in another system. This distinction keeps table parsing accurate because OpenClaw treats summaries and structured data extraction as different tasks.
How to parse scanned PDFs with OCR
To parse scanned PDFs with OpenClaw, run optical character recognition (OCR) before extracting fields, summarizing the document, or saving structured output. Scanned PDFs store pages as images, so OpenClaw needs OCR to convert the visible text into machine-readable content first.
Use OCR when a PDF has:
- no selectable text
- photographed or scanned pages
- signed and scanned contracts
- receipt or invoice images
- old paper archives
- screenshots saved as PDFs
- forms with stamps, handwriting, or signatures
- multilingual printed documents
Start with a direct OCR instruction:
Run OCR on this scanned PDF before extracting any fields. After OCR: 1. Extract the readable text from each page. 2. Detect the document type. 3. Extract the key fields based on the document type. 4. Save the OCR text separately from the structured output. 5. Mark unclear pages, low-confidence text, handwriting, stamps, or missing fields as needs_review.
For scanned invoices and receipts, combine OCR with a strict accounting schema. This prevents the parser from returning a loose summary when the workflow needs usable invoice data.
Run OCR on this scanned invoice or receipt.
Then extract:
{
"vendor_name": "string or unknown",
"invoice_number": "string or unknown",
"issue_date": "YYYY-MM-DD or unknown",
"due_date": "YYYY-MM-DD or unknown",
"subtotal": "number or unknown",
"tax": "number or unknown",
"total_amount": "number or unknown",
"currency": "string or unknown",
"payment_method": "string or unknown",
"line_items": [
{
"description": "string",
"quantity": "number or unknown",
"unit_price": "number or unknown",
"line_total": "number or unknown"
}
],
"ocr_confidence": "high | medium | low",
"missing_fields": ["string"],
"requires_human_review": "true | false"
}
If vendor_name, issue_date, or total_amount is missing, mark requires_human_review as true.For scanned contracts, ask OpenClaw to separate typed text from areas that often need manual review. OCR usually reads printed clauses better than handwritten edits, signatures, stamps, or marginal notes.
Run OCR on this scanned contract. After OCR: 1. Extract the typed contract text. 2. Identify the parties, effective date, renewal date, payment terms, termination clause, and obligations. 3. Identify signature pages, stamps, handwritten edits, or unreadable sections. 4. Mark handwritten or unclear content as requires_human_review. 5. Return the structured contract fields as JSON.
For scanned archives, process pages in batches and keep the confidence status attached to each file. This helps separate reliable OCR output from documents that need human checking.
Run OCR on every scanned PDF in this folder. For each file: 1. Extract text page by page. 2. Detect the document type. 3. Extract sender, date, title, key entities, and a 1-sentence summary. 4. Save the OCR text as a separate text file. 5. Save structured metadata as JSON. 6. Mark the file as high, medium, or low confidence. 7. Add low-confidence files to a review queue.
A useful confidence rule is:
Assign OCR confidence after parsing: High confidence: The main text is readable, required fields are present, and page structure is clear. Medium confidence: Most text is readable, but one or more fields are uncertain, incomplete, or affected by formatting. Low confidence: The scan is blurry, skewed, handwritten, cut off, distorted, or missing required fields. Only treat high-confidence OCR output as final. Mark medium- and low-confidence results as needs_review.
OCR output should not be treated as perfect text. Scanned documents often contain skewed pages, faded ink, watermarks, multi-column layouts, handwritten notes, stamps, and broken characters. These issues can change names, dates, totals, and clause wording, so OpenClaw should preserve uncertainty rather than fill gaps.
Use OCR first, then extraction second, and validation third. OCR converts the scanned PDF to readable text, extraction turns the OCR text into structured fields, and validation checks whether the parsed data is complete enough to use. This sequence keeps scanned PDF parsing accurate because OpenClaw does not summarize or structure a document before it has reliable text to work with.
How to run private document parsing with local models
To run private document parsing with OpenClaw, connect OpenClaw to a local model provider such as Ollama and keep the source files inside an approved workspace folder. This setup lets OpenClaw parse documents without sending their contents to external model APIs, which is useful for contracts, client records, internal reports, HR files, and other sensitive documents.
Local parsing works best for text-based PDFs. OpenClaw can extract PDF text locally and send it to the local model for summarization, classification, or structured field extraction. Scanned PDFs need OCR first because a text-only local model cannot read image-based pages without a separate vision or OCR step.
First, install and run a local model provider on the same server or machine where OpenClaw runs. For example, with Ollama, you can pull a text model for regular document parsing:
ollama pull llama3.1
Then configure OpenClaw to use the local model as the primary PDF model:
{
"agents": {
"defaults": {
"pdfModel": {
"primary": "ollama:llama3.1"
},
"pdfMaxBytesMb": 25,
"pdfMaxPages": 50
}
}
}In this setup, OpenClaw uses a local model for PDF parsing rather than a hosted model provider. Since most local text models do not support native PDF input, OpenClaw should use fallback parsing: it extracts readable PDF text first, then sends that text to the local model.
For stronger privacy, restrict the folders that OpenClaw can read and write:
{
"filePolicy": "workspace-only"
}Then keep private documents inside a dedicated workspace folder, such as:
/Documents/OpenClaw/ /Inbox/ /Parsed/ /Review/
This prevents the agent from reading unrelated files outside the approved workspace. It also makes document parsing easier to audit because every source file, extracted text file, and structured output are kept in a single controlled location.
Use this prompt when parsing a private text-based PDF:
Parse this PDF using the local model only. Extract: - document type - title - sender or author - date - key entities - important dates - amounts - 5-bullet summary - missing fields Return the result as JSON. Do not send the document text to external APIs. If a field is not visible in the PDF, return "unknown" instead of guessing.
For private contracts, use a stricter schema:
Parse this contract using the local model only.
Return JSON with:
{
"contract_type": "string",
"parties": ["string"],
"effective_date": "YYYY-MM-DD or unknown",
"renewal_date": "YYYY-MM-DD or unknown",
"payment_terms": "string or unknown",
"termination_clause": "string or unknown",
"confidentiality_clause": "string or unknown",
"liability_clause": "string or unknown",
"obligations": [
{
"party": "string",
"obligation": "string",
"deadline": "string or unknown"
}
],
"requires_human_review": "true | false",
"missing_fields": ["string"]
}
Mark requires_human_review as true if a clause is unclear, missing, contradictory, or incomplete.For scanned private documents, add OCR before local model parsing. A practical private pipeline is:
- Run local OCR on the scanned PDF.
- Save the OCR text inside the workspace.
- Send only the OCR text to the local model.
- Extract structured fields from the OCR text.
- Mark low-confidence pages for review.
Use this instruction:
Run local OCR on this scanned PDF before parsing. After OCR: 1. Save the OCR text in the workspace. 2. Parse the OCR text with the local model only. 3. Extract the required fields as JSON. 4. Mark blurry, handwritten, incomplete, or low-confidence pages as needs_review. 5. Do not send the document or OCR text to external APIs.
Local models are better for privacy, but they usually need more validation than hosted document models. They may struggle with long PDFs, complex tables, small print, handwritten notes, or visual layouts. For that reason, keep a review rule in every private parsing workflow:
Before saving the parsed result as final, check: 1. Required fields are present. 2. Dates use a consistent format. 3. Amounts include currency. 4. Extracted clauses match visible document text. 5. OCR confidence is high if the document was scanned. 6. No required field was inferred from context alone. If validation fails, save the result as needs_review.
Use local document parsing when privacy matters more than convenience. Use hosted PDF or OCR models when the document layout is complex, the scan quality is poor, or the workflow needs stronger vision understanding. A good private setup keeps the document local, extracts only the fields the workflow needs, and routes uncertain results to human review instead of guessing.
How to validate parsed document data
To validate parsed document data in OpenClaw, check whether the extracted fields are complete, correctly formatted, consistent with the source document, and safe to use without human review. Validation should happen after extraction and before the parsed result is saved, sent to another app, or used in a business process.
A parsing workflow should never treat extracted data as final just because OpenClaw returned a JSON, CSV, or Markdown output. PDFs can contain broken text layers, scanned pages, merged table cells, unclear dates, handwritten notes, or conflicting values. Validation catches these issues before incorrect data enters a spreadsheet, contract tracker, hiring database, or archive.
Start with a required-field check. Each document type should have its own list of fields that must be present before the output is marked as final.
Validate the parsed output before saving it as final. Required fields: - document_type - source_file - date - sender_or_author - summary If any required field is missing, empty, unclear, or marked as unknown, save the result as needs_review. Do not guess missing values.
For invoices and receipts, validate the fields that affect accounting accuracy:
Validate this parsed invoice before adding it to the invoice log. Check: 1. vendor_name is present. 2. invoice_number is present if shown. 3. issue_date and due_date use YYYY-MM-DD format. 4. subtotal, tax, and total_amount are numeric. 5. currency is present. 6. line_items are complete if the invoice has a visible item table. 7. total_amount matches subtotal + tax when those values are visible. 8. no required amount was inferred from context alone. If any check fails, save the invoice to /Documents/_review/ and explain which fields need checking.
For contracts, validation should focus on date accuracy, clause matching, and uncertainty. A contract parser should not make a final judgment when a clause is missing, ambiguous, or partially unreadable.
Validate this parsed contract before saving the review as final. Check: 1. parties are extracted from the contract text. 2. effective_date is linked to the correct clause. 3. renewal_date is linked to the correct renewal or term clause. 4. termination_notice_period is present if the contract contains a termination section. 5. payment_terms match the visible payment clause. 6. obligations are assigned to the correct party. 7. high-risk or unclear clauses are marked as requires_human_review. 8. no legal conclusion is stated without visible supporting text. If any clause is unclear, save the output as needs_review.
For resumes, validation should protect against false inference and inconsistent candidate records:
Validate this parsed resume before adding it to the candidate tracker. Check: 1. candidate_name is present. 2. email or another contact method is present if shown. 3. work_experience entries include company, role, and dates when available. 4. skills are extracted from the resume, not inferred from job titles alone. 5. education is listed only if shown. 6. protected characteristics are not extracted or used in screening notes. 7. missing fields are listed in missing_fields. If name, contact details, or work experience is missing, mark the record as needs_recruiter_review.
For table extraction, validate the structure before the CSV is used. This prevents broken PDF tables from creating inaccurate rows in the spreadsheet.
Validate the extracted table before marking the CSV as final. Check: 1. Column headers are present. 2. Row lengths are consistent. 3. Repeated page headers are removed. 4. Required columns are not blank. 5. Values were not split across unrelated columns. 6. Currency symbols, percentages, and footnote markers are preserved. 7. Visible totals match extracted totals when totals are present. If the table is misaligned, incomplete, or inconsistent, save the CSV as needs_review.
For OCR-based parsing, validation should check confidence and readability before structured fields are trusted:
Validate the OCR output before extracting final fields. Check: 1. OCR confidence is high for pages with required fields. 2. The document is not blurry, skewed, cut off, or incomplete. 3. Handwritten text, stamps, and signatures are marked separately. 4. Key dates, names, totals, and clauses are readable. 5. Low-confidence text is not used as a final value. If OCR confidence is medium or low, extract what is readable and mark the output as needs_review.
A good validation rule should return one of three statuses:
- final — all required fields are present, formatted correctly, and supported by visible document text
- needs_review — one or more fields are missing, unclear, low-confidence, inconsistent, or business-critical
- failed — the document could not be parsed, the file is unreadable, or the output is unusable
Use this final validation wrapper for most parsing workflows:
After parsing, assign one validation status:
final:
Use this only when all required fields are present, formats are correct, and the values are supported by the document.
needs_review:
Use this when fields are missing, unclear, inconsistent, low-confidence, or require human approval.
failed:
Use this when the document cannot be read, OCR fails, the file is corrupted, or the output does not match the requested schema.
Return:
{
"validation_status": "final | needs_review | failed",
"validation_errors": ["string"],
"missing_fields": ["string"],
"review_reason": "string or null"
}Validation keeps document parsing reliable by separating extracted data from approved data. OpenClaw can read and structure the document, but the workflow should only save final results when the output is complete, formatted consistently, and supported by the source file.
Common OpenClaw document parsing errors
Common OpenClaw document parsing errors usually happen because the file path is blocked, the PDF has no readable text, the selected model cannot process the file, or the parsed output does not match the requested schema. Most errors can be fixed by checking the file location, parsing mode, OCR step, page limits, and validation rules.
Why does OpenClaw return unsupported_pdf_reference?
OpenClaw returns unsupported_pdf_reference when the PDF path is unsupported, outside the agent’s workspace, or otherwise not allowed. This often happens when you pass an HTTP URL, a file from an unapproved folder, or a path that does not exist inside the OpenClaw workspace.
Fix it by moving the PDF into the approved workspace folder and referencing the local file path:
/workspace/documents/invoice-2026-04.pdf
If you use a workspace-only file policy, keep all source PDFs in the allowed folder. Do not point the parser to system folders, private home directories, or external URLs unless your setup explicitly allows those sources.
Why does OpenClaw ignore the page range?
OpenClaw may ignore the page range when the parser uses native PDF mode. In native PDF mode, the whole PDF can be sent directly to the model provider, so page arguments such as pages: “1-5,12” may not control which pages are read.
Fix it by using fallback parsing or splitting the PDF before parsing. For example, create a smaller PDF that contains only the target pages, then send that file to OpenClaw.
Use page ranges when you are extracting specific sections, such as:
- invoice totals
- contract renewal clauses
- signature pages
- appendix tables
- financial statements
- specific resume sections
If exact page control matters, test the parser with a small PDF first and confirm that only the requested pages are included in the output.
Why does a scanned PDF return empty text?
A scanned PDF returns empty text when the file contains page images instead of a selectable text layer. OpenClaw cannot extract normal PDF text from an image-only scan unless the workflow runs OCR or uses a vision-capable model.
Fix it by adding an OCR step before field extraction:
Run OCR on this scanned PDF first. Then extract the required fields from the OCR text. If OCR confidence is medium or low, mark the output as needs_review.
For best results, route scanned invoices, contracts, receipts, and archive documents through OCR before asking for JSON, CSV, or summaries.
Why does the parser miss invoice fields?
OpenClaw may miss invoice fields when the prompt asks for a general summary instead of a fixed invoice schema. Invoice data is easier to extract when the parser knows which fields matter.
Fix it by using a required field list:
Extract vendor_name, invoice_number, issue_date, due_date, subtotal, tax, total_amount, currency, and line_items. If any field is missing, return "unknown" and list it in missing_fields.
Also validate the total amount against the visible subtotal, tax, and line items when possible. If the totals do not match, save the invoice as needs_review.
Why are PDF tables extracted incorrectly?
PDF tables extract incorrectly when the document contains merged cells, repeated headers, wrapped text, hidden borders, multi-page rows, or columns that are positioned visually rather than structurally. A general PDF parser may understand the table but still fail to return clean spreadsheet rows.
Fix it by using a table extraction skill and adding validation rules:
Extract each table as a separate CSV. Preserve row order, column headers, currency symbols, percentages, and footnote markers. If row lengths are uneven or totals do not match, mark the CSV as needs_review.
For recurring documents from the same vendor or report source, create a reusable schema so OpenClaw maps similar tables to the same column names each time.
Why does the JSON output break or include extra text?
JSON output breaks when the prompt allows explanations, Markdown formatting, comments, or mixed response formats. This can make the result unusable for scripts, databases, and spreadsheets.
Fix it by asking for JSON only and defining the schema clearly:
Return JSON only. Do not include Markdown, comments, explanations, or text outside the JSON object.
Add a validation step after parsing. If the response does not match the schema, ask OpenClaw to repair the JSON before saving it.
Why are dates parsed in the wrong format?
Dates are parsed incorrectly when the PDF uses ambiguous formats such as 04/05/2026, where the day and month depend on the country. This commonly affects invoices, receipts, contracts, and application forms.
Fix it by defining the expected date format in the prompt:
Use YYYY-MM-DD for all dates. Assume dates are DD-MM-YYYY unless the document clearly uses another format. If the date is ambiguous, return the visible date and mark it as needs_review.
For contracts and invoices, link each parsed date to its meaning, such as issue date, due date, renewal date, termination deadline, or effective date.
Why does OpenClaw hallucinate missing fields?
OpenClaw may hallucinate missing fields when the prompt asks for a complete record but does not say what to do with unavailable values. This is risky for invoices, contracts, resumes, tax records, and compliance documents.
Fix it by explicitly forbidding inference:
Do not infer, estimate, or guess missing values. If a value is not visible in the document, return "unknown" and add the field name to missing_fields.
Use a review status when required fields are missing. This keeps incomplete parsed data from being treated as final.
Why does local parsing fail on scanned or visual PDFs?
Local parsing often fails on scanned or visual PDFs because many local models are text-only. They can process extracted text, but they cannot reliably read page images, handwriting, stamps, or visual table layouts without OCR or a vision model.
Fix it by running local OCR first:
Run local OCR on the scanned PDF. Save the OCR text in the workspace. Then parse the OCR text with the local model only.
Use local parsing for privacy-sensitive text-based PDFs. Add OCR for scanned private files, and route low-confidence OCR output to human review.
Why does OpenClaw process too much of the document?
OpenClaw processes too much of the document when the file is long, the parser reads the full PDF, or the prompt does not identify the target section. This can increase processing time and make the output less focused.
Fix it by narrowing the request:
Extract only the renewal clause, termination clause, payment terms, and important dates from this contract. Ignore unrelated sections unless they affect those fields.
For long reports, split the PDF or parse only the needed sections, such as the executive summary, financial tables, risk section, or appendix.
Why is the parsed output not safe to use automatically?
Parsed output is not safe to use automatically when required fields are missing, OCR confidence is low, totals do not match, clauses are unclear, or the document contains sensitive or high-risk information. Parsing creates structured data, but validation determines whether it is ready for use.
Fix it by assigning a validation status:
Return one validation status: - final - needs_review - failed Use final only when all required fields are present and supported by visible document text. Use needs_review when data is missing, unclear, inconsistent, or low-confidence. Use failed when the document cannot be parsed.
This error-handling rule keeps OpenClaw useful without letting uncertain data move into spreadsheets, contract trackers, HR systems, or compliance workflows unchecked.
Key takeaways for OpenClaw document parsing
OpenClaw document parsing works best when each document type has the right parser, extraction schema, validation rule, and review path. Use the built-in PDF tool for readable PDFs, optical character recognition (OCR) for scanned files, table extraction for row-based data, and structured schemas when the output needs to become JSON, CSV, or another reusable format.
A reliable OpenClaw parsing process follows this order:
- Read the document with the built-in PDF tool, OCR, table extraction, or a local parser.
- Extract the required fields with a document-specific schema.
- Validate the output for missing fields, incorrect formats, low OCR confidence, mismatched totals, unclear clauses, or broken tables.
- Mark uncertain results as
needs_reviewinstead of filling gaps. - Save only validated data to spreadsheets, databases, archives, or downstream workflows.
Start with the simplest parsing setup first. A text-based PDF with a fixed JSON schema is usually easiest to validate because the document already contains readable text. After that works reliably, add more complex formats, such as invoices, table-heavy reports, scanned documents, contracts, and private local parsing.
Use a managed OpenClaw solution if you want document parsing without manual server configuration. This setup keeps OpenClaw online while it parses PDFs, extracts fields, and validates outputs. Use a self-managed VPS setup instead if you need root access for local models, custom OCR binaries, private infrastructure controls, or specialized document-processing tools.
The most important rule is to separate parsed data from approved data. OpenClaw can extract text, tables, dates, clauses, amounts, and entities from documents, but the workflow should treat the results as final only when the values are visible in the source file and pass validation.
Once OpenClaw produces reliable JSON or CSV outputs, those outputs can become the foundation for AI document workflows with OpenClaw, such as inbox filing, contract review queues, invoice logs, searchable archives, alerts, or app syncing.
All of the tutorial content on this website is subject to Hostinger's rigorous editorial standards and values.