IngestingPdfContent

This project contains the companion code for GPT-4o versus Azure Document Intelligence and Azure Computer Vision OCR

Notebook NameDescription
PdfToTextPages.ipynbBaseline, makes a text file for each page using pypdf to extract the text
PdfToPageImages.ipynbGiven a folder of PDF files, converts each page of each PDF file into JPEG images using a resolution of 300 dpi, and saves them in a structured directory format.
DocIntelligencePipeline.ipynbC# Polyglot notebook with functions to convert an entire PDF to markdown and another that creates an OCR markdown file from each image created using PdfToPageImages.ipynb
turbo-2024-04-09.ipynbAzure Open AI using GPT-4 with vision to create a markdown file for each image created using PdfToPageImages.ipynb
v4omni.ipynbOpenAI using GPT-4o to create a markdown file for each image created using PdfToPageImages.ipynb
v4omni-image-plus-docIntelOcr.ipynbOpenAI using GPT-4o to create a markdown file for each image created using PdfToPageImages.ipynb grounded with OCR text created using DocIntelligencePipeline.ipynb
visionWithOcr.ipynbAzure Computer Vision GPT4-Vision OCR
visionWithOcrAndGrounding.ipynbAzure Computer Vision GPT4-Vision OCR and grounding