第一步:
依据conf目录下的program.list文件在raw_data下面建立一个各个节目名称的文件夹
依据conf目录下的program_keywords文件在各个节目路径下面建立该节目对应的过滤词文件
第二步:
依据节目的过滤词从sina_weibo.data中根据每个节目下的若干个关键词依次进行过滤
得到对应的program.data文件格式为
提取到的字段为
微博id($2) 用户id($3) 创建时间($5) 转发($11) 评论($12) 赞($13) 内容($6)
以上两个步骤处理的完整脚本文件为:
#!/bin/sh root_dir=/home/minelab/liweibo source_file=/home/minelab/cctv2014/data_warehouse/sina_weibo.data conf_dir=$root_dir/conf raw_dir=$root_dir/raw_data #在raw_data目录下面建立各个节目名称命名的文件夹,同时建立各个节目下的关键词文件 #echo "make the program dir..." #while read line #do # rm -rf $raw_dir/$line # mkdir $raw_dir/$line # cat $conf_dir/program_keywords | grep $line | awk -F‘\t‘ ‘{for(i=1;i<=NF;i++) print $i}‘> $raw_dir/$line/$line.filterwords # echo $line" mkdir and get filter words is done@!" #done < $conf_dir/program.list echo ‘get the candidate tweet for each program filtering by the keywords...‘ program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.data rm -rf $raw_dir/$program/$program.uniq while read line do cat $source_file | grep $line | awk -F‘\t‘ ‘{print $2"\t"$3"\t"$5"\t"$11"\t"$12"\t"$13"\t"$6}‘>> $raw_dir/$program/$program.data done < $raw_dir/$program/$program.filterwords echo $program "filtering is done!" #去除链接以及文本去重 sed -i ‘1,$s/http:\/\/t\.cn\/[a-zA-Z0-9]\{4,9\}//g‘ $raw_dir/$program/$program.data echo $program "remove url is done..." cat $raw_dir/$program/$program.data | sort -t ‘ ‘ -k 7 | uniq -f 6 > $raw_dir/$program/$program.uniq echo $program "uniq is done ..." wc -l $raw_dir/$program/$program.uniq > $raw_dir/$program/$program.statistic echo $program "statistic is done..." done echo "preData is done..."
原文:http://www.cnblogs.com/bobodeboke/p/3575775.html