项目中需要用到jTidy把html格式化为xml文件,以便后续处理,然而在使用jTidy的时候,发现html文档里中文用jTidy转换后会乱码。
经过一阵研究,发现主要是jTidy的inCharEncoding和outCharEncoding需要设置为UTF-8才可以正常读取和写入中文字符。
关键代码只有两行:
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
完整代码如下:
import org.w3c.tidy.Configuration;
import org.w3c.tidy.Tidy;
import java.io.*;
public class Main {
public static final String WORK_DIR = "D:\\data\\temp\\jTidy\\";
public static final String INPUT_FILE = "input.html";
public static final String OUTPUT_FILE = "output.xml";
public static final String ERROR_LOG = "error.log";
public static void convert(InputStream inputStream, OutputStream outputStream){
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(0);
try {
tidy.setErrout(new PrintWriter(new FileWriter(WORK_DIR+ERROR_LOG)));
} catch (IOException e) {
e.printStackTrace();
}
tidy.parse(inputStream,outputStream);
}
public static void main(String args[]){
InputStream inputStream = null;
OutputStream outputStream = null;
try {
inputStream = new FileInputStream(WORK_DIR+INPUT_FILE);
outputStream = new FileOutputStream(WORK_DIR+OUTPUT_FILE);
convert(inputStream,outputStream);
inputStream.close();
outputStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
}
原文:https://www.cnblogs.com/haoyoung/p/10251080.html