OCR Processing Pipeline (2025)

Developed a modular OCR pipeline using Python, AWS Textract, and OpenAI’s GPT API to digitize archival content. Features include asynchronous Textract handling, automatic file conversion, GPT-based correction, entity extraction, and batch processing with detailed logging and validation tests.

Technologies: Python, AWS Textract, OpenAI GPT-4, S3, Boto3, ImageMagick, dotenv, Jupyter, Levenshtein Distance

View Project