最短摘要生成

时间：2014-05-22 22:36:59 阅读：610 评论：0 收藏：0 [点我收藏+]

题目：

Alibaba笔试题：给定一段产品的英文描述，包含M个英文字母，每个英文单词以空格分隔，无其他标点符号；
再给定N个英文单词关键字，请说明思路并编程实现方法String extractSummary(String description,String[] key words)，
目标是找出此产品描述中包含N个关键字（每个关键词至少出现一次）的长度最短的子串，作为产品简介输出。（不限编程语言）20分。

在网上看到一种用链表实现的方法，时间复杂度可以达到O（m+n）。在这里实现一下。

分析：

1，将传入的key words[]生成哈希表，便于以后的字符串比较。结构为KeyHash，如下：

 struct KeyHash 
{ 
　　int cnt; 
　　char key[]; 
　　int hash;
 }

结构体中的hash代表了关键字的哈希值，key代表了关键字，cnt代表了在当前的扫描过程中，扫描到的该关键字的个数。当然，作为哈希表结构，该结构体中还会有其它值，这里不赘述。初始状态下，所有哈希结构的cnt字段为0。
2，建立一个KeyWord结构，结构体如下：

struct KeyWord
{
　　int start;
　　KeyHash* key; 
　　KeyWord* next; 
　　KeyWord* prev; 
}

key 字段指向了建立的一个KeyWord 代表了当前扫描到的一个关键字，扫描到的多个关键字组成一个双向链表。 start 字段指向了关键字在文章中的起始位置。

3，建立几个全局变量: KeyWord* head，指向了双向链表的头，初始为NULL。 KeyWord* tail，指向了双向链表的尾，初始为NULL。 int minLen，当前扫描到的最短的摘要的长度，初始为0。 int minStartPos，当前扫描到的最短摘要的起始位置。 int needKeyCnt，还需要几个关键字才能够包括全部的关键字，初始为关键字的个数。

4，开始对文章进行扫描。每扫描到一个关键字时，就建立一个KeyWord 的结构并且将其连入到扫描到的双向链表中，更新head 和tail 结构，同时将对应的KeyHash 结构中的cnt 加1，表示扫描到了关键字。如果cnt 由0 变成了1，表示扫描到一个新的关键字，因此needKeyCnt 减1。

5，当needKeyCnt 变成0 时，表示扫描到了全部的关键字了。此时要进行一个操作：链表头优化。链表头指向的word 是摘要的起始点，可是如果对应的KeyHash 结构中的cnt 大于1，表示扫描到的摘要中还有该关键字，因此可以跳过该关键字。因此，此时将链表头更新为下一个关键字，同时，将对应的KeyHash 中的结构中的cnt 减1，重复这样的检查，直至某个链表头对应的KeyHash 结构中的cnt 为1，此时该结构不能够少了。

6，如果找到更短的minLength，则更新minLength 和minStartPos。

7，开始新一轮的搜索。此时摘除链表的第一个节点，将needKeyCnt 加1，将下一个节点作为链表头，同样的开始链表头优化措施。搜索从上一次的搜索结束处开始，不用回溯。就是所，搜索在整个算法的过程中是一直沿着文章向下的，不会回溯。，直至文章搜索完毕。

这样的算法的复杂度初步估计是O(M+N)。

代码：

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define SHIFT 2
#define SIZE 256

typedef struct
{
	int cnt;
	char *key;
	int hash;
}KeyHash;

typedef struct Keyword
{
	int start;
	KeyHash *keyword;
	struct Keyword *next;
	struct Keyword *prev;
}KeyWord;

KeyWord *head=NULL;
KeyWord *tail=NULL;
int minLen=100;
int minStartPos=0;
int needKeyCnt=2;

int Hash ( char * str )
{
	int temp = 0;
	int i = 0;

	while (str[i] !='\0')
	{
		temp = ((temp << SHIFT) + str[i]) % SIZE+1;//因为cnt初始化为0，所以要避开0。
		++i ;
	}
return temp;
}//构造哈希函数

void DoHashing(char **initArr,KeyHash *hashArr)
{
	int length=2;
	int i=0;
	int hashVal=0;

	for(i=0;i<length;i++)
	{
		hashVal=Hash(initArr[i]);
		while(hashVal==hashArr[hashVal].hash)
			hashVal+=1;  //处理冲突
		hashArr[hashVal].cnt=0;
		hashArr[hashVal].key=initArr[i];
		hashArr[hashVal].hash=hashVal;
	}
}

void ListHeadOptimize()
{
	KeyWord *tmp=NULL;

	while(head->keyword->cnt!=1)
	{
		(head->keyword->cnt)--;
		tmp=head;
		head=head->next;
		free(tmp);
	}
}//表头优化

int main()
{
	KeyHash hashArr[SIZE+1];
	char *initKeyArr[2]={"q0","q1"};
	char *initSummary[14]={"w0","w1","w2","w3","q0","w4","w5","q1","w6"
										,"w7","w8","q1","w9","q0"};  //注意在换行时逗号也要换下一行
	int hashVal=0;
	int i;
	KeyWord *pnew=NULL,*tmp=NULL;

	memset(hashArr,0,sizeof(KeyHash)*(SIZE+1));
	DoHashing(initKeyArr,hashArr);
	for(i=0;i<14;i++)
	{
		hashVal=Hash(initSummary[i]);
		if(hashVal==hashArr[hashVal].hash)
		{
			pnew=(KeyWord*)malloc(sizeof(KeyWord));
			pnew->start=i;
			pnew->keyword=&hashArr[hashVal];
			if(tail==NULL || head==NULL)
			{
				head=tail=pnew;
				pnew->prev=NULL;
				pnew->next=NULL;
			}
			else
			{
				tail->next=pnew;
				pnew->prev=tail;
				tail=pnew;
				pnew->next=NULL;
			}

			(hashArr[hashVal].cnt)++;
			if(hashArr[hashVal].cnt==1)
				needKeyCnt--;/*注意这里全局变量needKeyCnt减1是有条件的:
				只有在新加入的节点关键字在链表前面没出现过(即cnt由原来的0变成1)才能将needKeyCnt减1*/

			if(needKeyCnt==0)
			{
				ListHeadOptimize();
				if(tail->start - head->start < minLen)
				{
					minLen=tail->start - head->start;
					minStartPos=head->start;
				}
				(head->keyword->cnt)--;//这里也同样删掉了头节点所以也要减1
				tmp=head;
				head=head->next;
				free(tmp);
				ListHeadOptimize();
				needKeyCnt++;
			}
		}
	}

	for(i=minStartPos;i<=minStartPos+minLen;i++)
		printf("%s ",initSummary[i]);

	return 0;
}

最短摘要生成,布布扣,bubuko.com

最短摘要生成

原文：http://blog.csdn.net/wsx199397/article/details/26496705

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)