首页 > 其他 > 详细

how to get charset from string and file

时间:2014-04-12 06:13:42      阅读:341      评论:0      收藏:0      [点我收藏+]

a.get charset from string

public String getCharsetFromString(String srcString) throws IOException {
   BufferedInputStream bin = new BufferedInputStream(new ByteArrayInputStream(

           srcString.getBytes()));
   int p = (bin.read() << 8) + bin.read();
   String code = null;
   //the 0xefbb、0xfffe、0xfeff、0x5c75 at the beginning of each string, can be used to defines the char set
   switch (p) {
   case 0xefbb:
       code = "UTF-8";
       break;
   case 0xfffe:
       code = "Unicode";
       break;
   case 0xfeff:
       code = "UTF-16BE";
       break;
   case 0x5c75:
       code = "ANSI|ASCII";
       break;
   default:
       code = "ISO-8859-1";
    }
   return code;

}


b.get charset from file(not sure)

public String getCharsetFromFile(String filePath)throwsIOException{

   FileInputStream fis =null;

   InputStreamReader isr =null;

   String s;

   try{

       //new input stream reader is created

       fis =newFileInputStream(filePath);

       isr =newInputStreamReader(fis);

       //the name of the character encoding returned

       s=isr.getEncoding();

   }catch(Exception e){

       // print error

       System.out.print("The stream is already closed");

   }finally{

       // closes the stream and releases resources associatedif(fis!=null)

       fis.close();if(isr!=null)

       isr.close();

   }

   return s;

}

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

Or: you could ask the user. I‘ve already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.


c.get charset from file(sure)

private String getCharsetByInputStream(InputStream ins){
   String charset = "";
   if(null != ins){
       UniversalDetector detector = new UniversalDetector(null);
       try {
           byte[] buf = new byte[ins.available()];
           int nread;

           while ((nread = ins.read(buf)) > 0 && !detector.isDone()) {
               detector.handleData(buf, 0, nread);
           }
       } catch (IOException e) {
           LOG.error("--getCharsetByInputStream:error happened while getting charset from inputstream. ",e);
           charset = "utf-8";
           return charset;
         }
           detector.dataEnd();
           charset = detector.getDetectedCharset();
           if (charset == null || "".equals(charset)) {
               charset = "utf-8";
           }

           detector.reset();
   }else{
       charset = "utf-8";
   }

   return charset;
}

link to http://code.google.com/p/juniversalchardet/


then read inputstream as string with detected charset

private String parseTruncaredSizeBinaryResourceToString(Integer resourceKey, Integer limitedSize){
   String truncaredResourceText = null;
   InputStream ins = proactiveAnalysisService.retrieveGlobalResourceBinary(resourceKey, null, limitedSize);
   if (null != ins) {
       String charset = getCharsetByInputStream(ins);
       InputStreamReader reader = null;

       try {

           //skip to beginning after get charset by inputStream(which leads to end of inputStream)
           ins.reset();
           //ins.skip(ins.available());
       } catch (IOException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
       }

       try {
           reader = new InputStreamReader(ins, charset);
       } catch (UnsupportedEncodingException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
           return truncaredResourceText;
       }

       OutputStream out = null;
       try {
           out = new ByteArrayOutputStream();
           int i = -1;
           while ((i = reader.read()) != -1) {
               out.write(i);
           }

           truncaredResourceText = out.toString();
       } catch (IOException e) {
           LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend when reading inputsream. ", e);
       } finally {
           try{
               if (null != out) {
                   out.close();
               }
               if (null != ins) {
                   ins.close();
               }
               if(null != reader){
                  reader.close();
               }
           }catch(IOException e){
               LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend while close inputsream. ",e);
           }

       }
   }
   return truncaredResourceText;

}


link to http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

andhow to change index position of inputstream http://stackoverflow.com/questions/3474911/changing-the-index-positioning-in-inputstream


本文出自 “六度空间” 博客,请务必保留此出处http://jasonwalker.blog.51cto.com/7020143/1394395

how to get charset from string and file,布布扣,bubuko.com

how to get charset from string and file

原文:http://jasonwalker.blog.51cto.com/7020143/1394395

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!