个性化阅读
专注于IT技术分析

Tika MS Office文件提取示例

本文概述

为了提取诸如xls文件之类的Microsoft Office文件, Tika提供了OOXMLParser类。此类用于从Microsoft文件提取内容和元数据。它位于org.apache.tika.parser.microsoft.ooxml包中, 并包含下表中列出的各种构造函数和方法。

Tika OOXMLParser构造函数

Constructor Description
public OOXMLParser() 它用于实例化类。

Method Description
公共Set <MediaType> getSupportedTypes(ParseContext上下文) 它返回此解析器支持的媒体类型集。
公共无效解析(InputStream流, ContentHandler处理程序, 元数据元数据, ParseContext上下文)引发IOException, SAXException, TikaException 它将文档流解析为一系列XHTML SAX事件。

OOXMLParser示例

package tikaexample;

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class MSOfficeExample {
	
	public static void main(String[] args) throws IOException, SAXException, TikaException {
		
		 BodyContentHandler handler   = new BodyContentHandler();
		 OOXMLParser parser           = new OOXMLParser();
		 Metadata metadata            = new Metadata();
		 ParseContext pcontext        = new ParseContext();
		 try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream("srcmini.xls")) {
		        parser.parse(stream, handler, metadata, pcontext);
	     System.out.println("Document Content:" + handler.toString());
	     System.out.println("Document Metadata:");
	     String[] metadatas = metadata.names(); 
	     for(String data : metadatas) {
	         System.out.println(data + ":   " + metadata.get(data));  
	     }
		 }catch(Exception e) {System.out.println("Exception message: "+ e.getMessage());}		
	}
}

我们的文件包含以下内容。

Tika MS Office文件提取

输出

Document Content:Sheet1
	Employee Manual Punch
	In Time	Out Time	Device	Total Minute	Total Time	Working Minutes
	01-Nov-17 8:27:00 AM	01-Nov-17 6:30:00 PM	1	603	540	-63
	02-Nov-17 8:09:00 AM	02-Nov-17 6:30:00 PM	1	621	540	-81
	03-Nov-17 8:25:00 AM	03-Nov-17 6:30:00 PM	1	605	540	-65

Document Metadata:
date:   2018-05-06T11:20:06Z
cp:revision:   1
custom:DocSecurity:   0
dc:creator:   Reception
dcterms:created:   2017-12-03T08:38:57Z
language:   en-IN
Last-Modified:   2018-05-06T11:20:06Z
dcterms:modified:   2018-05-06T11:20:06Z
Last-Save-Date:   2018-05-06T11:20:06Z
Template:   
protected:   false
meta:save-date:   2018-05-06T11:20:06Z
Application-Name:   LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2
modified:   2018-05-06T11:20:06Z
custom:LinksUpToDate:   false
Content-Type:   application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
creator:   Reception
dc:language:   en-IN
meta:author:   Reception
meta:creation-date:   2017-12-03T08:38:57Z
extended-properties:Application:   LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2
custom:ShareDoc:   false
custom:ScaleCrop:   false
Creation-Date:   2017-12-03T08:38:57Z
custom:HyperlinksChanged:   false
Revision-Number:   1
extended-properties:Template:   
custom:AppVersion:   12.0000
赞(0) 打赏
未经允许不得转载:srcmini » Tika MS Office文件提取示例
分享到: 更多 (0)

评论 抢沙发

评论前必须登录!

 

觉得文章有用就打赏一下文章作者

微信扫一扫打赏