The 5 tiers of data for AI
The last few posts were a bit more philosophical, so let's get nitty and gritty.
Say you have an idea for solving a problem related to processing documents, and from our AI Checklist, you found that AI should be a good fit. What's next? Next comes the data engineering part, data science's under-appreciated cousin.
The data readiness model
Your options and approach depend very much on the format that your data exists in.
Tier 5: Physical paper
True story: At a previous employer, a client wanted to explore whether quantum computing could help his business run more effectively. When asked about the existing process (to have a baseline against which to compare), it was revealed that everything ran on paper. That is one hard lift! Before you can even begin to use AI, you need to have your data in digital form. Some services will digitize your paper documents for you, but if your business runs on paper and keeps generating more of it, that whole process needs to get digitized first with a modern workflow.
Tier 4: Scanned PDFs
If your documents are available in some scanned format, that's better than nothing, but unless the actual data we're interested in is an image, we need to get our hands on the text that's contained in the document. Optical Character Recognition (OCR) has been around for a while, so this isn't a dramatically hard lift, but it leads us to the next challenge.
Tier 3: Text-first PDFs
These PDFs are what you get if you export, say, a Word document to PDF, or apply OCR to a scanned PDF: The text you see on the screen is also represented, as text, in the document. That sounds good, but there's a catch: The PDF format only knows layout and formatting, with no concept of structure and semantics. This is mildly annoying when you want to extract the main body of text and pesky footers such as "Page 23" get mixed in, but it becomes a real pain when dealing with tables. A PDF doc will just say, "Okay, put some text here, put this other text here, put some vertical lines here and there."
Tier 2: Word documents (aka semi-structured)
Finally, we're dealing with structured text formats. You can ask for the document's paragraphs and not get a garbled mess. It might even include information such as "This here is a top-level heading" and "This here is a subheading", which may be relevant for your AI tool. I call them semi-structured because there's no clear separation between content semantics and content layout/styling, but at least we can readily extract the data that's there.
Tier 1: Fully structured content
The gold standard for any ML and AI application would be any format that is semantics first or otherwise highly structured.
Relational (SQL) databases
CSV files or Excel sheets
Structured document formats such as JSON or XML (as long as the keys or tags provide useful structuring information), possibly returned by an API.
NoSQL databases (MongoDB, Cassandra, etc), as long as the data is structured appropriately
Much of data engineering is about getting your data, no matter where and in what form it's coming from, to one of these formats so that any downstream application doesn't have to worry about the messiness of the lower tiers.
In a future post, we'll see how current AI models fare with the various tiers, and the tradeoffs involved in moving between the tiers versus using models that can deal with the lower tiers.