对于含多字节的字符串,进行截断的时候,要判断截断处是几字节字符,不能将多字节从中分割,避免截断后乱码
下面给出utf8和gb18030上的实现, 用任何一种都可以,可以先进行转码,用encode, decode;
方法1:对utf8: 参考:http://blog.csdn.net/marising/article/details/3452971
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 |
def
subString(string,length): if
length >=
len(string): return
string result =
‘‘ i =
0 p =
0 while
True: ch =
ord(string[i]) #1111110x if
ch >=
252: p =
p +
6 #111110xx elif
ch >=
248: p =
p +
5 #11110xxx elif
ch >=
240: p =
p +
4 #1110xxxx elif
ch >=
224: p =
p +
3 #110xxxxx elif
ch >=
192: p =
p +
2 else: p =
p +
1 if
p >=
length: break; else: i =
p return
string[0:i] |
方法2:对gb18030编码
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29 |
def
cut_string_off(string,s_len): if
len(string)==0
or
s_len <=0: return
string elif
len(string)==1
or
s_len >=
len(string): return
string elif
s_len < len(string): len_num=0 while
len_num < s_len: tmp_c=ord(string[len_num]) if
tmp_c >0
and
tmp_c <=0x7F: len_num+=1 continue tmp_nextc=ord(string[len_num+1]) if
tmp_c >=
0x81
and
tmp_c <=0xFE
and
tmp_nextc>=0x40
and
tmp_nextc<=0xFE: len_num+=2 continue else: len_num +=1; continue break tmp =
string[0:len_num]# print utf2gbk(tmp) return
tmp |
原文:http://www.cnblogs.com/liyuxia713/p/3518689.html