由于工作原因,打算对larbin的源码进行分析一番
用的事2.6.3版本的larbin源码,由于这是业余,会断断续续的分析上传,已做记录笔记
今天我们分析一下larbin的哈希表
这个哈希表结构比较简单,因为它的主要用处是排重,因此只给出了用于排重的简单函数,
我们来看一下头文件怎么定义的:
// Larbin // Sebastien Ailleret // 23-11-99 -> 14-01-00 /* class hashTable * This class is in charge of making sure we don‘t crawl twice the same url */ #ifndef HASHTABLE_H #define HASHTABLE_H #include "types.h" #include "utils/url.h" class hashTable { private: ssize_t size; char *table; public: /* constructor */ hashTable (bool create); /* destructor */ ~hashTable (); /* save the hashTable in a file */ void save(); /* test if this url is allready in the hashtable * return true if it has been added * return false if it has allready been seen */ bool test (url *U); /* set a url as present in the hashtable */ void set (url *U); /* add a new url in the hashtable * return true if it has been added * return false if it has allready been seen */ bool testSet (url *U); }; #endif // HASHTABLE_H
由头文件我们可以看出,这个哈希表仅仅有四个成员函数(除了构造和析构)
save 函数是用于保存哈希表内部的数据,用于防止程序异常退出而造成数据丢失,因此把哈希内数据保存到一个文件中
test 函数用于测试参数指定的URL是否在哈希表内存在,只要是排重
set 函数就是判断出需要设置哈希表内值得时候设置该位置的URL对应的值,表示该URL从此开始存在于哈希表中
testset 是一个辅助函数,先判断,然后设置该位置的值,并且返回设置前的判断结果
下面我们就仔细来看一看各个函数的实现,比较简单,我就在程序中做了简单注释,就不再多余的文字解释了
构造函数:
hashTable::hashTable (bool create) { //构造函数,申请哈希需求的空间并初始化 ssize_t total = hashSize/8; //因为是位集合判断,所以每个字节8位,对于哈希的总成都除以8 table = new char[total]; //申请哈希空间,其实这个地方主要是以数组巧妙勾勒哈希功能 if (create) { //是一个标志,也就是说哈希内部的数据是从文件内读取还是初始化位0 for (ssize_t i=0; i<hashSize/8; i++) { table[i] = 0; } } else { int fds = open("hashtable.bak", O_RDONLY); //从bak备份文件读取数据 if (fds < 0) { cerr << "Cannot find hashtable.bak, restart from scratch\n"; for (ssize_t i=0; i<hashSize/8; i++) { //如果打开备份文件失败,就重新赋值位0,当做第一次看待 table[i] = 0; } } else { ssize_t sr = 0; while (sr < total) { ssize_t tmp = read(fds, table+sr, total-sr); //然后循环读取文件,直到成功读取所有数据 if (tmp <= 0) { cerr << "Cannot read hashtable.bak : " << strerror(errno) << endl; exit(1); } else { sr += tmp; //增加每次读取的数据 } } close(fds); //关闭文件描述符 } } }
析构函数:
hashTable::~hashTable () { //析构函数,释放哈希申请的空间 delete [] table; }
测试函数test:
bool hashTable::test (url *U) { //判断该url对应的是否存在哈希中,如果存在返回true,否则false unsigned int code = U->hashCode(); //根据hashCode函数求散列值 unsigned int pos = code / 8; unsigned int bits = 1 << (code % 8); return table[pos] & bits; }
设置函数:
void hashTable::set (url *U) { //设置url对应哈希值 unsigned int code = U->hashCode(); unsigned int pos = code / 8; unsigned int bits = 1 << (code % 8); table[pos] |= bits; }
测试设置函数:
bool hashTable::testSet (url *U) { //返回测试结果,并且设置url对应的值 unsigned int code = U->hashCode(); unsigned int pos = code / 8; unsigned int bits = 1 << (code % 8); int res = table[pos] & bits; table[pos] |= bits; return !res; }
保存文件函数:
void hashTable::save() { //把哈希内部数据存到文件,该过程是循序渐进的,防止时间间隔过程造成数据大量失真 rename("hashtable.bak", "hashtable.old"); //先把先前备份文件保存,直到最后本次成功备份后删除 int fds = creat("hashtable.bak", 00600); if (fds >= 0) { ecrireBuff(fds, table, hashSize/8); //辅助函数,把哈希数据存到备份文件 close(fds); } unlink("hashtable.old"); }
该哈希的处理部分就这么多,下面重点来看看我们两个知识点
1,散列函数 hashCode:
uint url::hashCode () { unsigned int h=port; unsigned int i=0; while (host[i] != 0) { h = 31*h + host[i]; i++; } i=0; while (file[i] != 0) { h = 31*h + file[i]; i++; } return h % hashSize; }
说起来散列函数,要求很有艺术的,而且散列函数也不可能有百分百的通用性。
一般都要自己根据哈希设置自己的散列函数。最起码要设置某些数值,用同一个散列方法和框架
该散列函数比较简单,就是把host和file(URL类中的两个字段,表示主机和文件路径)依次乘以31
然后对哈希最大值求余数,最大值这样定义的:
#define hashSize 64000000
另外对于host和file的先后哈希顺序也是设计的,先host而后file是为了让同一host对应的file的差异更大,减缓相似冲突
2,下面我们就来谈谈URL这个类,上面那个哈希散列函数就是这个类中的一个成员函数,之所以单独摘出去说,是因为散列函数也是重大的一块
我们先来看一下URL类的定义:
class url { private: char *host; char *file; uint16_t port; // the order of variables is important for physical size int8_t depth; /* parse the url */ void parse (char *s); /** parse a file with base */ void parseWithBase (char *u, url *base); /* normalize file name */ bool normalize (char *file); /* Does this url starts with a protocol name */ bool isProtocol (char *s); /* constructor used by giveBase */ url (char *host, uint port, char *file); public: /* Constructor : Parses an url (u is deleted) */ url (char *u, int8_t depth, url *base); /* constructor used by input */ url (char *line, int8_t depth); /* Constructor : read the url from a file (cf serialize) */ url (char *line); /* Destructor */ ~url (); /* inet addr (once calculated) */ struct in_addr addr; /* Is it a valid url ? */ bool isValid (); /* print an URL */ void print (); /* return the host */ inline char *getHost () { return host; } /* return the port */ inline uint getPort () { return port; } /* return the file */ inline char *getFile () { return file; } /** Depth in the Site */ inline int8_t getDepth () { return depth; } /* Set depth to max if we are at an entry point in the site * try to find the ip addr * answer false if forbidden by robots.txt, true otherwise */ bool initOK (url *from); /** return the base of the url * give means that you have to delete the string yourself */ url *giveBase (); /** return a char * representation of the url * give means that you have to delete the string yourself */ char *giveUrl (); /** write the url in a buffer * buf must be at least of size maxUrlSize * returns the size of what has been written (not including ‘\0‘) */ int writeUrl (char *buf); /* serialize the url for the Persistent Fifo */ char *serialize (); /* very thread unsafe serialisation in a static buffer */ char *getUrl(); /* return a hashcode for the host of this url */ uint hostHashCode (); /* return a hashcode for this url */ uint hashCode (); #ifdef URL_TAGS /* tag associated to this url */ uint tag; #endif // URL_TAGS #ifdef COOKIES /* cookies associated with this page */ char *cookie; void addCookie(char *header); #else // COOKIES inline void addCookie(char *header) {} #endif // COOKIES };
稍后待续。。。
原文:http://www.cnblogs.com/lfsblack/p/3891573.html