java如何读取doc文件

使用java如何读取doc文件，保证不会乱码

如果不需要把图片读取出来，可以用下面的方法
public static void testWord1(){

try {

//word 2003：图片不会被读取

InputStream is = new FileInputStream(new File("c:\a.doc"));

WordExtractor ex = new WordExtractor(is);

String text2003 = ex.getText().trim();

System.out.println(text2003);

//word 2007 图片不会被读取，表格中的数据会被放在字符串的最后

// OPCPackage opcPackage = POIXMLDocument.openPackage("c:\a.doc");

// POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);

               // String text2007 = extractor.getText();     
              //System.out.println(text2007);     

     } catch (Exception e) {     
               e.printStackTrace();     
     }

如果是word2003用前半部分
如果是2007用后半部分

POI 设置编码

如果只有文字，没有图片、表格等
可以用下面的方法
先下载jacob
http://sourceforge.net/project/showfiles.php?group_id=109543&package_id=118368
需要将acob-1.15-M4-x86.dll放在system32和jdk的bin下
先将word文档转成txt，然后从txt中读取
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;

public class WordReader1 {

public static void extractDoc(String inputFIle, String outputFile) {

boolean flag = false; // 打开Word 应用程序

ActiveXComponent app = new ActiveXComponent("Word.Application");

try {

// 设置word 不可见

app.setProperty("Visible", new Variant(false));

// 打开word 文件

Dispatch doc1 = app.getProperty("Documents").toDispatch();

Dispatch doc2 = Dispatch.invoke(doc1,"Open",Dispatch.Method,new Object[] { inputFIle, new Variant(false), new Variant(true) }, new int[1]).toDispatch();

// 作为txt 格式保存到临时文件

Dispatch.invoke(doc2, "SaveAs", Dispatch.Method, new Object[] {outputFile, new Variant(7) }, new int[1]); // 关闭word
Variant f = new Variant(false);

Dispatch.call(doc2, "Close", f);

flag = true;

} catch (Exception e) {

e.printStackTrace();

} finally {

app.invoke("Quit", new Variant[] {});

}

if (flag == true) {

System.out.println("Transformed Successfully");

} else {

System.out.println("Transform Failed");

}

}

public static void main(String[] args) {

WordReader1.extractDoc("c:/a.doc", "c:/a.txt");
}
}

http://download.csdn.net/detail/hcs371239924/3761147

使用poi:
package org.apache.poi.hwpf;
19

20 import org.apache.poi.hwpf.model.FileInformationBlock;
21 import org.apache.poi.poifs.filesystem.DocumentEntry;
22 import org.apache.poi.poifs.filesystem.POIFSFileSystem;
23 import org.apache.poi.POIDataSamples;
24

25

26 public final class HWPFDocFixture
27 {
28 public static final String DEFAULT_TEST_FILE = "test.doc";
29

30 public byte[] _tableStream;
31 public byte[] _mainStream;
32 public FileInformationBlock _fib;
33 private String _testFile;
34

35 public HWPFDocFixture(Object obj, String testFile)
36 {
37 _testFile = testFile;
38 }
39

40 public void setUp()
41 {
42 try
43 {
44 POIFSFileSystem filesystem = new POIFSFileSystem(
45 POIDataSamples.getDocumentInstance().openResourceAsStream(_testFile));
46

47 DocumentEntry documentProps =
48 (DocumentEntry) filesystem.getRoot().getEntry("WordDocument");
49 _mainStream = new byte[documentProps.getSize()];
50 filesystem.createDocumentInputStream("WordDocument").read(_mainStream);
51

52 // use the fib to determine the name of the table stream.
53 _fib = new FileInformationBlock(_mainStream);
54

55 String name = "0Table";
56 if (_fib.getFibBase().isFWhichTblStm())
57 {
58 name = "1Table";
59 }
60

61 // read in the table stream.
62 DocumentEntry tableProps =
63 (DocumentEntry) filesystem.getRoot().getEntry(name);
64 _tableStream = new byte[tableProps.getSize()];
65 filesystem.createDocumentInputStream(name).read(_tableStream);
66

67 _fib.fillVariableFields(_mainStream, _tableStream);
68 }
69 catch (Throwable t)
70 {
71 t.printStackTrace();
72 }
73 }
74

75 public void tearDown()
76 {
77 }
78

79 }

这种问题，明显是查API就能解决的事！

没什么用，表格和图片都读不了。连最基本的格式都读不出来。