Projects_Tweets Analysis

时间：2016-02-05 01:32:28 阅读：195 评论：0 收藏：0 [点我收藏+]

This is course project of real time big data analysis. I collected aout 120,000 tweets, which is roughly 4GB, which are analyzed to extract top urls and hashtags on Twitter using Hadoop map/reduce. These Tweets are collected using Twitter Streaming APIs in a JSON format. The data are loaded into HDFS on NYU HPC clusters, to take advantage of their fast computing ability. In map task, each tweet is parsed with displayURL, hashtagEntities, text as key and their corresponding contents as value. Mark the occurence of each entity and send to reducer. The reducer will get the total count of each displayURL or hashtagEntities. Then Do a second map/reduce to get the order of each entity. Then I got the top popular links which has the most people watched during that time, also for the top hashtags. The total time of map/reduce computing for the 4GB data takes about 18 seconds, including the JVM lauching for each map task.

For political popularity analysis of presendential selection, I use text matching to see how many people mentioned each politician. This will get a general sense of how popular they are on social media.

Obstacle: Advertisements,

HashMap = [“displayURL” : “some_url_here”,

“hashtagEntities” : “#tag1, #tag2, …”,

“text” : “something_typed_here”]

Projects_Tweets Analysis

原文：http://www.cnblogs.com/touchdown/p/5182455.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)