專(zhuān)業(yè)APP客戶端做網(wǎng)站蘇州首頁(yè)關(guān)鍵詞優(yōu)化
引言:
處理大量PDF文檔的文本提取任務(wù)可能是一項(xiàng)繁瑣的工作。本文將介紹一個(gè)使用Python編寫(xiě)的工具,可通過(guò)簡(jiǎn)單的操作一鍵提取大量PDF文檔中的文本內(nèi)容,極大地提高工作效率。
import wx
import pathlib
import fitzclass PDFExtractor(wx.Frame):def __init__(self, parent, title):super(PDFExtractor, self).__init__(parent, title=title, size=(400, 200))panel = wx.Panel(self)vbox = wx.BoxSizer(wx.VERTICAL)self.file_picker = wx.FilePickerCtrl(panel, style=wx.FLP_DEFAULT_STYLE | wx.FLP_USE_TEXTCTRL)self.save_picker = wx.DirPickerCtrl(panel, style=wx.DIRP_DEFAULT_STYLE | wx.DIRP_USE_TEXTCTRL)self.extract_button = wx.Button(panel, label="提取")self.extract_button.Bind(wx.EVT_BUTTON, self.on_extract)vbox.Add(wx.StaticText(panel, label="選擇PDF文件:"), 0, wx.ALL | wx.EXPAND, 5)vbox.Add(self.file_picker, 0, wx.ALL | wx.EXPAND, 5)vbox.Add(wx.StaticText(panel, label="選擇輸出路徑:"), 0, wx.ALL | wx.EXPAND, 5)vbox.Add(self.save_picker, 0, wx.ALL | wx.EXPAND, 5)vbox.Add(self.extract_button, 0, wx.ALL | wx.CENTER, 5)panel.SetSizer(vbox)def on_extract(self, event):pdf_path = self.file_picker.GetPath()save_path = self.save_picker.GetPath()if pdf_path and save_path:progress_dialog = wx.ProgressDialog("提取進(jìn)度", "正在提取...", maximum=100, parent=self)try:with fitz.open(pdf_path) as doc:total_pages = len(doc)progress = 0for index, page in enumerate(doc):text = page.get_text()output_file = pathlib.Path(save_path) / f"page_{index + 1}.txt"output_file.write_text(text, encoding="utf-8")progress = int((index + 1) / total_pages * 100)progress_dialog.Update(progress, f"正在提取第 {index + 1} 頁(yè) / 共 {total_pages} 頁(yè)")progress_dialog.Update(100, "提取完成!")wx.MessageBox("提取完成!", "成功", wx.OK | wx.ICON_INFORMATION)except Exception as e:wx.MessageBox(str(e), "錯(cuò)誤", wx.OK | wx.ICON_ERROR)finally:progress_dialog.Destroy()else:wx.MessageBox("請(qǐng)選擇PDF文件和輸出路徑!", "錯(cuò)誤", wx.OK | wx.ICON_ERROR)def main():app = wx.App()frame = PDFExtractor(None, "PDF提取器")frame.Show()app.MainLoop()if __name__ == '__main__':main()
在這個(gè)示例中,我們創(chuàng)建了一個(gè)wx.ProgressDialog
對(duì)象,用于顯示提取進(jìn)度。在提取每一頁(yè)的文本時(shí),我們使用enumerate
函數(shù)獲取當(dāng)前頁(yè)的索引,并根據(jù)總頁(yè)數(shù)計(jì)算提取進(jìn)度的百分比。然后,我們使用progress_dialog.Update
方法更新進(jìn)度條的進(jìn)度和顯示的文本。
請(qǐng)注意,由于提取過(guò)程可能需要一些時(shí)間,所以我們使用進(jìn)度條對(duì)話框來(lái)顯示進(jìn)度并阻止用戶的交互。在提取完成后,進(jìn)度條對(duì)話框會(huì)自動(dòng)關(guān)閉。
其中:
1)文檔選擇:?
? ? ? self.file_picker = wx.FilePickerCtrl(panel, style=wx.FLP_DEFAULT_STYLE | wx.FLP_USE_TEXTCTRL)
2、文件夾選擇:? ? ?
? self.save_picker = wx.DirPickerCtrl(panel, style=wx.DIRP_DEFAULT_STYLE | wx.DIRP_USE_TEXTCTRL)
3、進(jìn)度顯示:
progress = int((index + 1) / total_pages * 100)progress_dialog.Update(progress, f"正在提取第 {index + 1} 頁(yè) / 共 {total_pages} 頁(yè)")progress_dialog.Update(100, "提取完成!")
4、最重要的:獲得pdf中的文本:
with fitz.open(pdf_path) as doc:total_pages = len(doc)progress = 0for index, page in enumerate(doc):text = page.get_text()output_file = pathlib.Path(save_path) / f"page_{index + 1}.txt"output_file.write_text(text, encoding="utf-8")
結(jié)果如下:
??