使用java如何读取doc文件,保证不会乱码
如果不需要把图片读取出来,可以用下面的方法
public static void testWord1(){
try {
//word 2003: 图片不会被读取
InputStream is = new FileInputStream(new File("c:\a.doc"));
WordExtractor ex = new WordExtractor(is);
String text2003 = ex.getText().trim();
System.out.println(text2003);
//word 2007 图片不会被读取, 表格中的数据会被放在字符串的最后
// OPCPackage opcPackage = POIXMLDocument.openPackage("c:\a.doc");
// POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
// String text2007 = extractor.getText();
//System.out.println(text2007);
} catch (Exception e) {
e.printStackTrace();
}
如果是word2003用前半部分
如果是2007用后半部分
POI 设置编码
如果只有文字,没有图片、表格等
可以用下面的方法
先下载jacob
http://sourceforge.net/project/showfiles.php?group_id=109543&package_id=118368
需要将acob-1.15-M4-x86.dll放在system32和jdk的bin下
先将word文档转成txt,然后从txt中读取
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
public class WordReader1 {
public static void extractDoc(String inputFIle, String outputFile) {
boolean flag = false; // 打开Word 应用程序
ActiveXComponent app = new ActiveXComponent("Word.Application");
try {
// 设置word 不可见
app.setProperty("Visible", new Variant(false));
// 打开word 文件
Dispatch doc1 = app.getProperty("Documents").toDispatch();
Dispatch doc2 = Dispatch.invoke(doc1,"Open",Dispatch.Method,new Object[] { inputFIle, new Variant(false), new Variant(true) }, new int[1]).toDispatch();
// 作为txt 格式保存到临时文件
Dispatch.invoke(doc2, "SaveAs", Dispatch.Method, new Object[] {outputFile, new Variant(7) }, new int[1]); // 关闭word
Variant f = new Variant(false);
Dispatch.call(doc2, "Close", f);
flag = true;
} catch (Exception e) {
e.printStackTrace();
} finally {
app.invoke("Quit", new Variant[] {});
}
if (flag == true) {
System.out.println("Transformed Successfully");
} else {
System.out.println("Transform Failed");
}
}
public static void main(String[] args) {
WordReader1.extractDoc("c:/a.doc", "c:/a.txt");
}
}
http://download.csdn.net/detail/hcs371239924/3761147
使用poi:
package org.apache.poi.hwpf;
19
20 import org.apache.poi.hwpf.model.FileInformationBlock;
21 import org.apache.poi.poifs.filesystem.DocumentEntry;
22 import org.apache.poi.poifs.filesystem.POIFSFileSystem;
23 import org.apache.poi.POIDataSamples;
24
25
26 public final class HWPFDocFixture
27 {
28 public static final String DEFAULT_TEST_FILE = "test.doc";
29
30 public byte[] _tableStream;
31 public byte[] _mainStream;
32 public FileInformationBlock _fib;
33 private String _testFile;
34
35 public HWPFDocFixture(Object obj, String testFile)
36 {
37 _testFile = testFile;
38 }
39
40 public void setUp()
41 {
42 try
43 {
44 POIFSFileSystem filesystem = new POIFSFileSystem(
45 POIDataSamples.getDocumentInstance().openResourceAsStream(_testFile));
46
47 DocumentEntry documentProps =
48 (DocumentEntry) filesystem.getRoot().getEntry("WordDocument");
49 _mainStream = new byte[documentProps.getSize()];
50 filesystem.createDocumentInputStream("WordDocument").read(_mainStream);
51
52 // use the fib to determine the name of the table stream.
53 _fib = new FileInformationBlock(_mainStream);
54
55 String name = "0Table";
56 if (_fib.getFibBase().isFWhichTblStm())
57 {
58 name = "1Table";
59 }
60
61 // read in the table stream.
62 DocumentEntry tableProps =
63 (DocumentEntry) filesystem.getRoot().getEntry(name);
64 _tableStream = new byte[tableProps.getSize()];
65 filesystem.createDocumentInputStream(name).read(_tableStream);
66
67 _fib.fillVariableFields(_mainStream, _tableStream);
68 }
69 catch (Throwable t)
70 {
71 t.printStackTrace();
72 }
73 }
74
75 public void tearDown()
76 {
77 }
78
79 }
这种问题,明显是查API就能解决的事!
没什么用,表格和图片都读不了。连最基本的格式都读不出来。