企業(yè)網(wǎng)站托管注意事項蘇州seo門戶網(wǎng)
目錄結(jié)構(gòu)
- 前言
- tika簡介
- Tika支持的文件格式
- MAVEN依賴
- JAVA程序
- JAVA測試程序
- 測試文件
- 測試結(jié)果
- 部分文件提取失敗
- 參考連接
前言
Apache Tika提取文件整理如下,如有特定的文件需要提取可以先參照【部分文件提取失敗】章節(jié)對照,以免浪費您的寶貴時間,如有問題或者解決辦法還望大牛不吝賜教,小編在此謝過!
tika簡介
Tika全名Apache Tika,是用于文件類型檢測和從各種格式的文件中提取內(nèi)容的一個庫。
Tika使用現(xiàn)有的各種文件解析器和文檔類型的檢測技術(shù)來檢測和提取數(shù)據(jù)。
使用Tika,可以輕松提取到的不同類型的文件內(nèi)容,如電子表格,文本文件,圖像,PDF文件甚至多媒體輸入格式,在一定程度上提取結(jié)構(gòu)化文本以及元數(shù)據(jù)。
統(tǒng)一解析器接口:Tika封裝在一個單一的解析器接口的第三方解析器庫。由于這個特征,用戶逸出從選擇合適的解析器庫的負(fù)擔(dān),并使用它,根據(jù)所遇到的文件類型。
使用的Tika facade類是從Java調(diào)用Tika的最簡單和直接的方式,而且也沿用了外觀的設(shè)計模式??梢栽?Tika API的org.apache.tika包Tika 找到外觀facade類。
Tika提供用于解析不同文件格式的一個通用API。它采用83個現(xiàn)有的專業(yè)解析器庫,所有這些解析器庫是根據(jù)一個叫做Parser接口單一接口封裝。
Tika支持的文件格式
文件格式 | 類庫 | Tika中的類 |
---|---|---|
XML | org.apache.tika.parser.xml | XMLParser |
HTML | org.apache.tika.parser.htmll and it uses Tagsoup Library | HtmlParser |
MS-Office compound document Ole2 till 2007 ooxml 2007 onwards | org.apache.tika.parser.microsoftorg.apache.tika.parser.microsoft.ooxml and it uses Apache Poi library | OfficeParser(ole2)OOXMLParser(ooxml) |
OpenDocument Format openoffice | org.apache.tika.parser.odf | OpenOfficeParser |
portable Document Format(PDF) | org.apache.tika.parser.pdf and this package uses Apache PdfBox library | PDFParser |
Electronic Publication Format (digital books) | org.apache.tika.parser.epub | EpubParser |
Rich Text format | org.apache.tika.parser.rtf | RTFParser |
Compression and packaging formats | org.apache.tika.parser.pkg and this package uses Common compress library | PackageParser and CompressorParser and its sub-classes |
Text format | org.apache.tika.parser.txt | TXTParser |
Feed and syndication formats | org.apache.tika.parser.feed | FeedParser |
Audio formats | org.apache.tika.parser.audio and org.apache.tika.parser.mp3 | AudioParser MidiParser Mp3- for mp3parser |
Imageparsers | org.apache.tika.parser.jpeg | JpegParser-for jpeg images |
Videoformats | org.apache.tika.parser.mp4 and org.apache.tika.parser.video this parser internally uses Simple Algorithm to parse flash video formats | Mp4parser FlvParser |
java class files and jar files | org.apache.tika.parser.asm | ClassParser CompressorParser |
Mobxformat (email messages) | org.apache.tika.parser.mbox | MobXParser |
Cad formats | org.apache.tika.parser.dwg | DWGParser |
FontFormats | org.apache.tika.parser.font | TrueTypeParser |
executable programs and libraries | org.apache.tika.parser.executable | ExecutableParser |
MAVEN依賴
目前已經(jīng)有2.8.0版本,有興趣的朋友可以嘗試一下,使用感受可以和小編交流一下哦~
<repositories><repository><id>com.e-iceblue</id><name>e-iceblue</name><url>http://repo.e-iceblue.com/nexus/content/groups/public/</url></repository>
</repositories><dependencies><dependency><groupId>org.apache.tika</groupId><artifactId>tika-parsers</artifactId><version>1.24</version></dependency><dependency><groupId>org.apache.tika</groupId><artifactId>tika-core</artifactId><version>1.24</version></dependency>
</dependencies>
JAVA程序
package com.xxx.xxx.carry;import cn.hutool.core.lang.UUID;
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.ComThread;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
import org.apache.commons.io.FilenameUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.extractor.EmbeddedDocumentExtractor;
import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;public class CarryingFileUtils {// 創(chuàng)建解析器,使用AutoDetectParser可以自動檢測一個最合適的解析器private static Parser parser = new AutoDetectParser();private static Detector detector = ((AutoDetectParser) parser).getDetector();private static TikaConfig config = TikaConfig.getDefaultConfig();public static void extract(InputStream is, Path outputDir) throws SAXException, TikaException, IOException {Metadata m = new Metadata();// 指定最基本的變量信息(即存放一個所使用的解析器對象)ParseContext c = new ParseContext();BodyContentHandler h = new BodyContentHandler(-1);c.set(Parser.class, parser);EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(outputDir, c);c.set(EmbeddedDocumentExtractor.class, ex);// InputStream-----指定文件輸入流// ContentHandler--指定要解析文件的哪一個內(nèi)容,它有一個實現(xiàn)類叫做BodyContentHandler,即專門用來解析文檔內(nèi)容的// Metadata--------指定解析文件時,存放解析出來的元數(shù)據(jù)的Metadata對象// ParseContext----該對象用于存放一些變量信息,該對象最少也要存放所使用的解析器對象,這也是其存放的最基本的變量信息parser.parse(is, h, m, c);}private static class MyEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor {private final Path outputDir;private int fileCount = 0;private MyEmbeddedDocumentExtractor(Path outputDir, ParseContext context) {super(context);this.outputDir = outputDir;}@Overridepublic boolean shouldParseEmbedded(Metadata metadata) {return true;}@Overridepublic void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws IOException {//try to get the name of the embedded file from the metadataString name = metadata.get(Metadata.RESOURCE_NAME_KEY);if (name == null) {name = "file_" + fileCount++;} else {//make sure to select only the file name (not any directory paths//that might be included in the name) and make sure//to normalize the namename = name.replaceAll("\u0000", " ");int prefix = FilenameUtils.getPrefixLength(name);if (prefix > -1) {name = name.substring(prefix);}name = FilenameUtils.normalize(FilenameUtils.getName(name));}//now try to figure out the right extension for the embedded fileMediaType contentType = detector.detect(stream, metadata);if (name.indexOf('.') == -1 && contentType != null) {try {name += config.getMimeRepository().forName(contentType.toString()).getExtension();} catch (MimeTypeException e) {e.printStackTrace();}}// 夾帶文件名編碼格式調(diào)整name = new String(name.getBytes("ISO-8859-1"), "GBK");Path outputFile = outputDir.resolve(name);if (Files.exists(outputFile)) {outputFile = outputDir.resolve(UUID.randomUUID().toString() + "-" + name);}Files.createDirectories(outputFile.getParent());String formart = name.substring(name.lastIndexOf(".") + 1).toUpperCase();// 去除無關(guān)文件if (!"EMF,WMF".contains(formart)) {Files.copy(stream, outputFile);}}}
}
JAVA測試程序
package com.xxx.xxx.utils;import com.xxx.xxx.carry.CarryingFileUtils;import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.file.Path;
import java.nio.file.Paths;public class Jkx {public static void main(String[] args) {// 提取文件String inputFilrPath = "C:\\Users\\Administrator\\Desktop\\file_check\\qiantao\\Excel文件嵌套doc.xlsx";// 輸出文件路徑String outFilePath = "C:\\Users\\Administrator\\Desktop\\file_check\\nest_file\\";try {InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFilrPath));Path outFileUrl = Paths.get(outFilePath);CarryingFileUtils.extract(inputStream, outFileUrl);} catch (Exception e) {e.printStackTrace();}}
}
測試文件
測試文件_百度網(wǎng)盤提取鏈接
測試結(jié)果
部分文件提取失敗
提取失敗文件整理如下,如有大牛有解決辦法還望不吝賜教:
文件類型 | 嵌套文件類型 |
---|---|
.dot | .doc |
.doc | .docm |
.doc | .wps |
.wps | .wps |
.xls | .xls |
.et | .et |
.xls | .et |
.xltm | .ett |
.pps | .ppt |
.html | .wps |
.mht | .wps |
.mhtml | .wps |
.pot | .pot |
.cebx | .* |
.dot | .doc |
.dps | .dps |
.pptx | .dps |
.dpt | .dps |
.docx | .eid |
.doc | .eis |
.png | .odp |
.png | .ods |
.png | .odt |
參考連接
- https://www.jianshu.com/p/407735f03094?v=1672195773961