1. XX文档的文档提取成品工具

    最近也是吧自己的博客网站弄出来了,顺便把脚本优化和打包出来了.

    2025/06/30 EXE

  2. XX文档的文档提取(TXT)

    TXT方法和word一样的,只是请求目录不一样 ```objc import re import os import time import json import random import requests import logging import glob import threading import queue import traceback import natsort import sys import shutil from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException, NoSuchElementException

    2025/06/29 python

  3. XX文档的文档提取(PDF)

    示例网站 PDF方法 就是直接加载完全页面,然后捕捉出渲染完成的图片 ```objc import os import time import base64 import logging import re from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from PIL import Image

    2025/06/28 python

  4. XX文档的文档提取(word纯文字)

    示例网站 查看网络,发现其中的图片都不是我们需要的,是找到一个相关的分隔线,查看标头,我们提取其中一段进行搜索 发现到两个js文件,和png图片,png图片就是分隔线 我们直接复制一部分js文件进行解析 wenku_1({"outline":null,"outlineMiss":null,"font":{"8a4d9d384b73f242336c5f440010001":"\u5b8b\u4f53","8a4d9d384b73f242336c5f440020001":"\u5b8b\u4f53","8a4d9d384b73f242336c5f440030001":"\u9ed1\u4f53","8a4d9d384b73f242336c5f440040001":"Arial","8a4d9d384b73f242336c5f440050001":"Arial Bold"},"style":[{"t":"style","c":[1,0],"s":{"font-size":"39.06"}},{"t":"style","c":[1],"s":{"font-family":"8a4d9d384b73f242336c5f440010001"}},{"t":"style","c":[0,1,6,7,10,17,18,2],"s":{"bold":"true"}},{"t":"style","c":[0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,3],"s":{"color":"#000000"}},{"t":"style","c":[0,1,9,4],"s":{"font-size":"39.06"}},{"t":"style","c":[1,6,15,5],"s":{"font-family":"8a4d9d384b73f242336c5f440010001"}},{"t":"style","c":[1,6],"s":{"font-family":"8a4d9d384b73f242336c5f440010001"}},{"t":"style","c":[10,7],"s":{"font-family":"8a4d9d384b73f242336c5f440020001"}},{"t":"style","c":[7,9,10,11,14,16,8],"s":{"font-family":"8a4d9d384b73f242336c5f440020001"}},{"t":"style","c":[9],"s":{"font-family":"8a4d9d384b73f242336c5f440020001"}},{"t":"style","c":[10],"s":{"font-size":"15.84"}},{"t":"style","c":[16,11],"s":{"font-size":"21.06"}},{"t":"style","c":[11,15,16,12],"s":{"font-size":"21.06"}},{"t":"style","c":[14,13],"s":{"font-size":"36"}},{"t":"style","c":[14],"s":{"font-size":"36"}},{"t":"style","c":[15],"s":{"font-size":"21.06"}},{"t":"style","c":[16],"s":{"letter-spacing":"0.09"}},{"t":"style","c":[18,17],"s":{"font-size":"23.94"}},{"t":"style","c":[18],"s":{"font-family":"8a4d9d384b73f242336c5f440050001"}}],"body":[{"c":"\u5357\u660c\u8f68\u9053\u4ea4\u901a\u5730\u94c1\u8fd0\u8425\u6709\u9650\u516c\u53f8","p":{"h":39.06,"w":547.933,"x":172.47,"y":123.437,"z":0},"ps":null,"s":{"letter-spacing":"0.084"},"t":"word","r":[1]},{"c":" ","p":{"h":39.06,"w":19.529,"x":720.689,"y":123.437,"z":1},"ps":{"_enter":1},"s":{"bold":"true"},"t":"word","r":[0,7,9]},{"c":"\u6280\u672f\u6587\u4ef6","p":{"h":39.06,"w":156.552,"x":368.175,"y":193.637,"z":2},"ps":null,"s":{"letter-spacing":"0.104"},"t":"word","r":[1]},{"c":" ","p":{"h":15.84,"w":7.919,"x":524.805,"y":213.583,"z":3},"ps":{"_enter":1},"t":"word","r":[10]},{"c":" Q\/NGYY-B-SWGD-FB-03-2015 ","p":{"h":21.06,"w":588.269,"x":180.39,"y":261.299,"z":4},"ps":{"_enter":1},"s":{"letter-spacing":"-0.025"},"t":"word","r":[11]},{"c":"V1.0 ","p":{"h":21.06,"w":52.469,"x":716.189,"y":308.099,"z":5},"ps":{"_enter":1},"s":{"letter-spacing":"-0.044"},"t":"word","r":[11]},{"c":" ","p":{"h":39.06,"w":19.529,"x":446.475,"y":357.437,"z":6},"ps":null,"t":"word","r":[9]},{"c":" ","p":{"h":39.06,"w":19.529,"x":446.475,"y":427.682,"z":7},"ps":{"_enter":1},"t":"word","r":[9]},{"c":"\u63a5\u89e6\u7f51\u8bbe\u5907\u5e94\u6025\u62a2\u4fee\u9884\u6848\u5b9e\u65bd\u7ec6\u5219","p":{"h":39.06,"w":585.419,"x":153.929,"y":518.942,"z":8},"ps":null,"s":{"font-family":"8a4d9d384b73f242336c5f440030001","letter-spacing":"-0.034"},"t":"word","r":[4]},{"c":" ","p":{"h":39.06,"w":10.858,"x":739.23,"y":518.942,"z":9},"ps":{"_enter":1},"s":{"font-family":"8a4d9d384b73f242336c5f440040001"},"t":"word","r":[4]},{"c":"\uff08\u8bd5\u884c\u7a3f\uff09","p":{"h":36,"w":179.999,"x":356.475,"y":599.87,"z":10},"ps":null,"s":{"font-family":"8a4d9d384b73f242336c5f440010001"},"t":"word","r":[5,13]},{"c":" ","p":{"h":36,"w":18,"x":536.505,"y":599.87,"z":11},"ps":{"_enter":1},"t":"word","r":[14]},{"c":" ","p":{"h":36,"w":18,"x":446.475,"y":646.7,"z":12},"ps":{"_enter":1},"s":{"bold":"true"},"t":"word","r":[7,14]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":697.858,"z":13},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":732.958,"z":14},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":768.058,"z":15},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":803.158,"z":16},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":838.258,"z":17},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":873.358,"z":18},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":908.458,"z":19},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":943.603,"z":20},"ps":null,"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.92,"x":446.475,"y":978.703,"z":21},"ps":{"_enter":1},"t":"word","r":[10]},{"c":" ","p":{"h":15.839,"w":7.919,"x":135.036,"y":1013.803,"z":22},"ps":{"_enter":1},"t":"word","r":[10]},{"c":"2015","p":{"h":21.059,"w":42.014,"x":168.149,"y":1052.159,"z":23},"ps":null,"s":{"letter-spacing":"-0.035"},"t":"word","r":[11]},{"c":"\u2014","p":{"h":21.059,"w":21.06,"x":210.27,"y":1052.159,"z":24},"ps":null,"t":"word","r":[15]},{"c":"05","p":{"h":21.059,"w":21.15,"x":231.149,"y":1052.159,"z":25},"ps":null,"t":"word","r":[16]},{"c":"\u2014","p":{"h":21.059,"w":21.06,"x":252.209,"y":1052.159,"z":26},"ps":null,"t":"word","r":[15]},{"c":"01","p":{"h":21.059,"w":20.97,"x":273.27,"y":1052.159,"z":27},"ps":null,"s":{"letter-spacing":"-0.089"},"t":"word","r":[11]},{"c":"\u53d1\u5e03","p":{"h":21.059,"w":42.12,"x":299.594,"y":1052.159,"z":28},"ps":null,"t":"word","r":[15]},{"c":" 2015","p":{"h":21.059,"w":252.104,"x":341.534,"y":1052.159,"z":29},"ps":null,"s":{"letter-spacing":"-0.026"},"t":"word","r":[11]},{"c":"\u2014","p":{"h":21.059,"w":21.059,"x":593.745,"y":1052.159,"z":30},"ps":null,"t":"word","r":[15]},{"c":"06","p":{"h":21.059,"w":20.969,"x":614.625,"y":1052.159,"z":31},"ps":null,"s":{"letter-spacing":"-0.09"},"t":"word","r":[11]},{"c":"\u2014","p":{"h":21.059,"w":21.059,"x":635.685,"y":1052.159,"z":32},"ps":null,"t":"word","r":[15]},{"c":"01","p":{"h":21.059,"w":21.149,"x":656.564,"y":1052.159,"z":33},"ps":null,"t":"word","r":[16]},{"c":"\u5b9e\u65bd","p":{"h":21.059,"w":41.94,"x":683.069,"y":1052.159,"z":34},"ps":null,"s":{"letter-spacing":"-0.179"},"t":"word","r":[15]},{"c":" ","p":{"h":21.059,"w":10.529,"x":725.01,"y":1052.159,"z":35},"ps":{"_enter":1},"t":"word","r":[11]},{"c":"\u5357\u660c\u8f68\u9053\u4ea4\u901a\u5730\u94c1\u8fd0\u8425\u6709\u9650\u516c\u53f8","p":{"h":23.94,"w":508.206,"x":153.21,"y":1097.379,"z":36},"ps":{"_scaleX":1.522},"s":{"font-family":"8a4d9d384b73f242336c5f440010001","letter-spacing":"-0.156"},"t":"word","r":[6,17]},{"c":" ","p":{"h":23.94,"w":36.403,"x":661.469,"y":1097.379,"z":37},"ps":{"_scaleX":1.522},"s":{"font-family":"8a4d9d384b73f242336c5f440020001","letter-spacing":"-0.043"},"t":"word","r":[7,17]},{"c":"\u53d1\u5e03","p":{"h":21.059,"w":42.119,"x":697.65,"y":1099.853,"z":38},"ps":null,"s":{"bold":"true"},"t":"word","r":[6,15]},{"c":" ","p":{"h":23.94,"w":6.655,"x":739.95,"y":1097.379,"z":39},"ps":null,"t":"word","r":[18]},{"c":{"ix":0,"iy":0,"iw":636,"ih":1},"p":{"h":1,"w":636,"x":136,"y":287.437,"z":40},"ps":{"_vector":1},"s":null,"t":"pic"},{"c":{"ix":0,"iy":6,"iw":636,"ih":2},"p":{"h":2,"w":636,"x":136,"y":1082.437,"z":41},"ps":{"_vector":1},"s":null,"t":"pic"},{"c":" ","p":{"h":23.94,"w":6.655,"x":751.829,"y":1097.379,"z":42},"ps":{"_enter":1},"t":"word","r":[18]}],"page":{"ph":1262.879,"pw":892.979,"iw":636,"ih":8,"v":6,"t":"1","pptlike":false,"cx":135.036,"cy":123.437,"cw":636.964,"ch":997.882}})

    2025/06/27 python

  5. XX文档的文档提取(word可编辑)

    word可编辑格式 这种是可以在线编辑的word,那么就可以直接从网页中提取文本流 示例网站 直接打开控制台,用元素定位他,直接就写在页面了,而且,其中也会包含有图片的,所以我们要既提取文本流,也要将图片插入到文档中。 有一个难点我不会解决,就是图片插入的大小,这个我不知道怎么控制,所以输出的文件,还得自己调整图片大小 ```objc import os import re import time import random import requests import json import urllib.parse from io import BytesIO from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from docx import Document from docx.shared import Inches, Cm, Pt from docx.enum.text import WD_ALIGN_PARAGRAPH from bs4 import BeautifulSoup from PIL import Image import base64 import io

    2025/06/26 python

  6. XX文档的文档提取(PPT-PDF)

    前提,大部分文档展开全文需要会员,所以要有个会员账号,有没有下载次数无所谓,能展开就行,可以去找共享,反正便宜的会员大把。

    2025/06/25 python