nutch网上有不少有它的源码解析,但是采集这块还是不太让人容易理解.今天终于知道怎么,弄的.现在把crawl-urlfilter.txt文件贴出来,让大家一块交流,也给自己备忘录一个。
# Licensed to the Apache Software Foundation (ASF) under one or more
#
contributor license agreements. See the NOTICE file distributed with
#
this work for additional information regarding copyright ownership.
# The ASF
licenses this file to You under the Apache License, Version 2.0
# (the
"License"); you may not use this file except in compliance with
# the
License. You may obtain a copy of the License
at
#
#
http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by
applicable law or agreed to in writing, software
# distributed under the
License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.
# See the License for the
specific language governing permissions and
# limitations under the
License.
# The url filter file used by the crawl command.
#
Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your
domain name.
# Each non-comment, non-blank line contains a regular
expression
# prefixed by ‘+‘ or ‘-‘. The first matching pattern in the
file
# determines whether a URL is included or ignored. If no
pattern
# matches, the URL is ignored.
# skip file:, ftp:, &
mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we
can‘t yet
parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
#
skip URLs containing certain characters as probable queries, etc.
//采集动态网站很重要。必须这样设置。不然像a.jsp?a=001 带有问号的网页就没办法采集。
+[?*!@=]
# skip
URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in
MY.DOMAIN.NAME
###########################7shop24########################################
#+^http://([a-z0-9]*\.)*7shop24.com/
#+^http://www.7shop24.com/indexdtl06.asp\?classid=([0-9]*)&productid=([0-9]*)+$
###############################http://www.redbaby.com.cn/##############################
//采集是有顺序的,不是随便写的。比如:你要采集产品页,你首先得把首页放进来,然后产品是放在分类页面的,你得把//分类也得包括进来,然后再把具体产品规则的正则写进来,这样才能完成你所需要的任务。如:
+^http://www.redbaby.com.cn/$
+^http://www.redbaby.com.cn/([a-zA-Z]*\.)*index.html$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/index\.html+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BrandID=\d&BranchID=\d+$
+^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w([0-9]*\.)*html$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d&SortID=\d+$
+^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w\d\.htm$
#
skip everything else
-.
url匹配可能用到的java正则:
? 对应 \?
_ (下划张) 对应 \w
.(点号) 对应 \.
Nutch URL过滤配置规则,布布扣,bubuko.com
原文:http://www.cnblogs.com/lixiuran/p/3682095.html