首页 > 其他 > 详细

Projects_Tweets Analysis

时间:2016-02-05 01:32:28      阅读:195      评论:0      收藏:0      [点我收藏+]

This is course project of real time big data analysis. I collected aout 120,000 tweets, which is roughly 4GB, which are analyzed to extract top urls and hashtags on Twitter using Hadoop map/reduce. These Tweets are collected using Twitter Streaming APIs in a JSON format. The data are loaded into HDFS on NYU HPC clusters, to take advantage of their fast computing ability. In map task, each tweet is parsed with displayURL, hashtagEntities, text as key and their corresponding contents as value. Mark the occurence of each entity and send to reducer. The reducer will get the total count of each displayURL or hashtagEntities. Then Do a second map/reduce to get the order of each entity. Then I got the top popular links which has the most people watched during that time, also for the top hashtags. The total time of map/reduce computing for the 4GB data takes about 18 seconds, including the JVM lauching for each map task.

For political popularity analysis of presendential selection, I use text matching to see how many people mentioned each politician. This will get a general sense of how popular they are on social media.

 

Obstacle: Advertisements, 

HashMap = [“displayURL” : “some_url_here”,

        “hashtagEntities” : “#tag1, #tag2, …”,

        “text” : “something_typed_here”]

Projects_Tweets Analysis

原文:http://www.cnblogs.com/touchdown/p/5182455.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!