c++ utf8 截取

时间：2015-11-10 23:51:16 阅读：324 评论：0 收藏：0 [点我收藏+]

utf8字符串截取，如果直接暴力截取(substr)可能会出现错误，因此搜索了下，发现了 python 版本

http://blog.csdn.net/dr_freedom/article/details/5457645

参照原理，实现了如下的 c++ 版本，记录在此

 1 const string utf8Cut(const string &src, int utf8Len) {
 2     string ret;
 3     int utf8LenCnt = 0;
 4     int srcIdx = 0;
 5     int srcLen = src.length();
 6     int cutLen = 0;
 7     unsigned char tmp;
 8     while (utf8LenCnt < utf8Len &&  srcIdx < srcLen) {
 9         tmp = (unsigned char)src[srcIdx];
10         if (tmp >= 252)
11             cutLen = 6;
12         else if (tmp >= 248)
13             cutLen = 5;
14         else if (tmp >= 240)
15             cutLen = 4;
16         else if (tmp >= 224)
17             cutLen = 3;
18         else if (tmp >= 192)
19             cutLen = 2;
20         else if (tmp >= 65 && tmp <=90)
21             cutLen = 1;
22         else
23             cutLen = 1;
24         ret += src.substr(srcIdx, cutLen);
25         srcIdx += cutLen;
26         ++utf8LenCnt;
27     }
28     return ret;
29 }

原理如下表

U-00000000 - U-0000007F	0xxxxxxx
U-00000080 - U-000007FF	110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

c++ utf8 截取

原文：http://www.cnblogs.com/envy-liu/p/4954881.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)