Amazon Textract
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents.
It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes,
Textract
uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
This sample demonstrates the use of Amazon Textract
in combination with LangChain as a DocumentLoader.
Textract
supportsPDF
, TIFF
, PNG
and JPEG
format.
Textract
supports these document sizes, languages and characters.
%pip install --upgrade --quiet boto3 langchain-openai tiktoken python-dotenv
%pip install --upgrade --quiet "amazon-textract-caller>=0.2.0"