This is course project of real time big data analysis. I collected aout 120,000 tweets, which is roughly 4GB, which are analyzed to extract top urls and hashtags on Twitter using Hadoop map/reduce. These Tweets are collected using Twitter Streaming APIs in a JSON format. The data are loaded into HDFS on NYU HPC clusters, to take advantage of their fast computing ability. In map task, each tweet is parsed with displayURL, hashtagEntities, text as key and their corresponding contents as value. Mark the occurence of each entity and send to reducer. The reducer will get the total count of each displayURL or hashtagEntities. Then Do a second map/reduce to get the order of each entity. Then I got the top popular links which has the most people watched during that time, also for the top hashtags. The total time of map/reduce computing for the 4GB data takes about 18 seconds, including the JVM lauching for each map task.
For political popularity analysis of presendential selection, I use text matching to see how many people mentioned each politician. This will get a general sense of how popular they are on social media.
Obstacle: Advertisements,
HashMap = [“displayURL” : “some_url_here”,
“hashtagEntities” : “#tag1, #tag2, …”,
“text” : “something_typed_here”]
原文:http://www.cnblogs.com/touchdown/p/5182455.html