utf8字符串截取,如果直接暴力截取(substr)可能会出现错误,因此搜索了下,发现了 python 版本
http://blog.csdn.net/dr_freedom/article/details/5457645
参照原理,实现了如下的 c++ 版本,记录在此
1 const string utf8Cut(const string &src, int utf8Len) { 2 string ret; 3 int utf8LenCnt = 0; 4 int srcIdx = 0; 5 int srcLen = src.length(); 6 int cutLen = 0; 7 unsigned char tmp; 8 while (utf8LenCnt < utf8Len && srcIdx < srcLen) { 9 tmp = (unsigned char)src[srcIdx]; 10 if (tmp >= 252) 11 cutLen = 6; 12 else if (tmp >= 248) 13 cutLen = 5; 14 else if (tmp >= 240) 15 cutLen = 4; 16 else if (tmp >= 224) 17 cutLen = 3; 18 else if (tmp >= 192) 19 cutLen = 2; 20 else if (tmp >= 65 && tmp <=90) 21 cutLen = 1; 22 else 23 cutLen = 1; 24 ret += src.substr(srcIdx, cutLen); 25 srcIdx += cutLen; 26 ++utf8LenCnt; 27 } 28 return ret; 29 }
原理如下表
U-00000000 - U-0000007F | 0xxxxxxx |
U-00000080 - U-000007FF | 110xxxxx 10xxxxxx |
U-00000800 - U-0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
U-00010000 - U-001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-00200000 - U-03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-04000000 - U-7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
原文:http://www.cnblogs.com/envy-liu/p/4954881.html