Thanks, that’s exactly the problem space we had in mind. Embeddings help you find text, but they don’t give you entities, types, or consistency. Schema discovery is our attempt to bridge that gap between documents and actual data models.
Funny you mention invoices, inconsistent layout + embedded images/logos/signatures is precisely where plain text extraction breaks down. Multimodal + schema normalization together made a big difference.
Curious whether those teams ended up building rule-based systems or going fully ML/LLM?
Thanks, that’s exactly the problem space we had in mind. Embeddings help you find text, but they don’t give you entities, types, or consistency. Schema discovery is our attempt to bridge that gap between documents and actual data models.
Funny you mention invoices, inconsistent layout + embedded images/logos/signatures is precisely where plain text extraction breaks down. Multimodal + schema normalization together made a big difference.
Curious whether those teams ended up building rule-based systems or going fully ML/LLM?