RAG Isn’t Enough: Structuring Documents…

Egor Tarasenko

Jan 6

An open-source approach to turning PDFs and DOCX files into structured datasets with LLMs

Read →

1 Comment

Comment removed

Comment removed

Thanks, that’s exactly the problem space we had in mind. Embeddings help you find text, but they don’t give you entities, types, or consistency. Schema discovery is our attempt to bridge that gap between documents and actual data models.

Funny you mention invoices, inconsistent layout + embedded images/logos/signatures is precisely where plain text extraction breaks down. Multimodal + schema normalization together made a big difference.

Curious whether those teams ended up building rule-based systems or going fully ML/LLM?

Reply

Share

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Data & AI Engineering @ Ponder

RAG Isn’t Enough: Structuring Documents…