wIndows phone 7 解析Html数据

时间：2014-03-05 18:01:33 阅读：671 评论：0 收藏：0 [点我收藏+]

原文:wIndows phone 7 解析Html数据

在我的上一篇文章中我介绍了windows phone 7的gb2312解码,

http://www.cnblogs.com/qingci/archive/2011/11/25/2263124.html

解决了下载的Html乱码问题,这一篇,我将介绍关于windows phone 7解析html数据，以便我们获得想要的数据.

这里,我先介绍一个类库HtmlAgilityPack,（上一篇文章也是通过这个工具来解码的）. 类库的dll文件我会随demo一起提供

这里,我以新浪新闻为例来解析数据

先看看网页版的新浪新闻

http://news.sina.com.cn/w/sd/2011-11-27/070023531646.shtml

然后我们看一下他的源文件，

发现新闻内容的结构是

<div class="blkContainerSblk">
                <h1 id="artibodyTitle" 
pid="1" 
tid="1" 
did="23531646" 
fid="1666">title</h1>
                <div class="artInfo"><span id="art_source"><a href="http://www.sina.com.cn">http://www.sina.com.cn</a></span>  <span id="pub_date">pub_date</span>  <span id="media_name"><a href="">media_name</a> <a href=""></a> </span></div>
 
                <!-- 正文内容 begin -->
                <!-- google_ad_section_start -->
 
                <div class="blkContainerSblkCon" 
id="artibody"></div>
</div>

大部分还有ID属性,这更适合我们去解析了。

接下来我们开始去解析

第一：引用HtmlAgilityPack.dll文件

第二：用WebClient或者WebRequest类来下载HTML页面然后处理成字符串。

public  
delegate void CallbackEvent(object 
sender, DownloadEventArgs e);
       public  
event CallbackEvent DownloadCallbackEvent;
       public 
void HttpWebRequestDownloadGet(string 
url)
       {
            
           Thread _thread = new 
Thread(delegate()
           {
               Uri _uri = new 
Uri(url, UriKind.RelativeOrAbsolute);
               HttpWebRequest _httpWebRequest = (HttpWebRequest)WebRequest.Create(_uri);
                _httpWebRequest.Method="Get";
              
               _httpWebRequest.BeginGetResponse(new 
AsyncCallback(delegate(IAsyncResult result)
               {
                   HttpWebRequest _httpWebRequestCallback = (HttpWebRequest)result.AsyncState;
                   HttpWebResponse _httpWebResponseCallback = (HttpWebResponse)_httpWebRequestCallback.EndGetResponse(result);
                   Stream _streamCallback = _httpWebResponseCallback.GetResponseStream();
 
                   StreamReader _streamReader = new 
StreamReader(_streamCallback,new 
HtmlAgilityPack.Gb2312Encoding());
                   string 
_stringCallback = _streamReader.ReadToEnd();
                 
                   Deployment.Current.Dispatcher.BeginInvoke(new 
Action(() =>
                   {
                       if 
(DownloadCallbackEvent != null)
                       {
                           DownloadEventArgs _downloadEventArgs = new 
DownloadEventArgs();
                           _downloadEventArgs._DownloadStream = _streamCallback;
                           _downloadEventArgs._DownloadString = _stringCallback;
                           DownloadCallbackEvent(this, _downloadEventArgs);
 
                       }
                   }));
 
               }), _httpWebRequest);
           }) ;
           _thread.Start();
       }
      // }

O(∩_∩)O! 我这个比较复杂, 总之我们下载了html的数据就行了。

贴一个简单的下载方式吧

WebClient webClenet=new 
WebClient();  
 
         webClenet.Encoding = new 
HtmlAgilityPack.Gb2312Encoding(); //加入这句设定编码  
 
         webClenet.DownloadStringAsync(new 
Uri("http://news.sina.com.cn/s/2011-11-25/120923524756.shtml", UriKind.RelativeOrAbsolute));       
 
         webClenet.DownloadStringCompleted += new 
DownloadStringCompletedEventHandler(webClenet_DownloadStringCompleted); 

现在处理回调函数的 e.Result

string 
_result = e._DownloadString;
 
           HtmlDocument _doc = new 
HtmlDocument(); //实例化HtmlAgilityPack.HtmlDocument对象
           _doc.LoadHtml(_result);         //载入HTML
 
           HtmlNode _htmlNode01 = _doc.GetElementbyId("artibodyTitle");  //新闻标题的Div
           string 
_title = _htmlNode01.InnerText;
 
           HtmlNode _htmlNode02 = _doc.GetElementbyId("artibody");     //获取内容的div  
           string 
_content = _htmlNode02.InnerText;
          // int _count= _htmlNode02.ChildNodes.Where(new Func<HtmlNode,bool>("div"));
           int 
_divIndex = _content.IndexOf(" .blkComment");
 
           _content= _content.Substring(0,_divIndex);
 
           #region　新浪标签
           HtmlNode _htmlNodo03 = _doc.GetElementbyId("art_source");
           string 
_www = _htmlNodo03.FirstChild.InnerText;
           string 
_wwwInt = _htmlNodo03.FirstChild.Attributes[0].Value;
           #endregion
           // string _source = _htmlNodo03;
           //_htmlNodo03.ChildNodes
 
           #region 发布时间
           HtmlNode _htmlNodo04 = _doc.GetElementbyId("pub_date");
           string 
_pub_date = _htmlNodo04.InnerText;
           #endregion
 
 
           #region 来源网站信息
           HtmlNode _htmlNodo05 = _doc.GetElementbyId("media_name");
           string 
_media_name = _htmlNodo05.FirstChild.InnerText;
           string 
_modia_source = _htmlNodo05.FirstChild.Attributes[0].Value;
           #endregion
 
           Media_nameHyperlinkButton.Content = _pub_date + " " 
+ _media_name;
           Media_nameHyperlinkButton.NavigateUri = new 
Uri(_modia_source, UriKind.RelativeOrAbsolute);
           TitleTextBlock.Text = _title;
           ContentTextBlock.Text = _content;

结果如下图所示：

bubuko.com,布布扣

网页的大部分标签是没有ID属性的,不过幸运的是HtmlAgilityPack支持XPath

那就需要通过XPATH语言来查找匹配所需节点

XPath教程：http://www.w3school.com.cn/xpath/index.asp

案例下载：

http://115.com/file/dn87dl2d#
MyFramework_Test.zip

wIndows phone 7 解析Html数据,布布扣,bubuko.com

wIndows phone 7 解析Html数据

原文：http://www.cnblogs.com/lonelyxmas/p/3581473.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)