AI approaches for the data tiers
In the previous post, we looked at the five tiers of data for AI, and now we'll see an interesting tradeoff:
Data in the top tier (structured data) works with just about any ML an AI model.
Data in the fourth tier (just one step above physical paper) will only work with fancy "multi-modal" (that is, text and vision) models.
The decision we have to make is:
Do we use complex data transformations to move our data to a higher tier so that we can use a simpler ML model?
Or do we keep our transformations simple so that we can throw a more complex model at it?
Modern multi-modal models (GPT-4o, for example, or Gemini 2.0) can take in a PDF, even with images, and extract a surprising amount of information. Still, the results are less predictable and guaranteed than if you had a more purpose-built extraction pipeline that would let you use a specialized ML model (or no ML model at all, which would be great for cost and reproducibility).
Some guidelines around making this choice:
Do a bit of prototyping and experimentation directly on the low-tier, raw data. How far can you push GPT, Gemini, etc?
How crucial is it that nothing gets missed? Are we looking for the general gist, or do we need to look for information line by line? The latter might require getting down and dirty with the data.
How far is the lift from the current to the desired tier? Going from a Word doc to a fully structured format is more manageable than going from a scanned PDF to a semi-structured document.
In the spirit of growing a complex system from small seeds the easiest way to get a functioning end-to-end system off the ground might indeed involve feeding your low-tier data into a top-shelf AI model. Then, once you can observe it in production, you can slowly add transformation steps that improve your data quality.
If you're currently wrestling with your data pipeline and your systems, we'd love to hear from you! What have you tried, what has and hasn't worked?