如何从任何PDF和图像中提取文本以供大型语言模型使用

提取PDF和图像中的文本供语言模型使用

使用这些文本提取技术为您的LLM模型获取高质量数据

动机

大型语言模型在互联网上风靡一时，导致更多人不太注意使用这些模型的最重要部分：质量数据！

本文旨在提供一些有效提取任何类型文档中文本的技术。完成本教程后，您将清楚了解根据您的用例使用哪种工具。

Python库

本文重点介绍Pytesseract、easyOCR、PyPDF2和LangChain库。实验数据是一个单页PDF文件，可以在我的GitHub上免费获取。

Pytesseract和easyOCR都适用于图像，因此需要将PDF文件转换为图像后进行内容提取。

可以使用强大的PDF文件处理库pypdfium2进行转换，其实现如下：

pip install pypdfium2

此函数以PDF作为输入，并将PDF的每一页作为图像的列表返回。

def convert_pdf_to_images(file_path, scale=300/72):        pdf_file = pdfium.PdfDocument(file_path)        page_indices = [i for i in range(len(pdf_file))]        renderer = pdf_file.render(        pdfium.PdfBitmap.to_pil,        page_indices = page_indices,         scale = scale,    )        final_images = []         for i, image in zip(page_indices, renderer):                image_byte_array = BytesIO()        image.save(image_byte_array, format='jpeg', optimize=True)        image_byte_array = image_byte_array.getvalue()        final_images.append(dict({i:image_byte_array}))        return final_images