2015-04-30 33 views
2

我在this pipeline的帮助下收集推文。我试图用一些自己的脚本来分析收集的脚本。我发现我收到了多个具有相同ID的推文。我查看了hdfs:// user/flume/tweets,并看到这多个tweets存储在存储文件中。所以它不是蜂巢或oozie问题。Twitter与具有相同ID的多个twitts流式传输

但愿这是水槽的问题:我在做水槽参数进行一些编辑:

TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 //in github 1000 
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100000 //in github 10000 

TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 100000 //in github 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 10000 //in github 100 

或Twitter给出了这样的微博?这不是hadoop问题?

UPD 1

这里是我的水槽的配置:

# The configuration file needs to define the sources, 
# the channels and the sinks. 
# Sources, channels and sinks are defined per agent, 
# in this case called 'TwitterAgent' 

TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS 

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource 
TwitterAgent.sources.Twitter.channels = MemChannel 
TwitterAgent.sources.Twitter.consumerKey = MyKey 
TwitterAgent.sources.Twitter.consumerSecret = MyKey 
TwitterAgent.sources.Twitter.accessToken = MyKey 
TwitterAgent.sources.Twitter.accessTokenSecret = MyKey 
TwitterAgent.sources.Twitter.keywords = hadoop, big-data , big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing 

TwitterAgent.sinks.HDFS.channel = MemChannel 
TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://rh-hadoop-master:8020/user/flume/tweets/%Y/%m/%d/%H/ 
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10000 
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100000 

TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 100000 
TwitterAgent.channels.MemChannel.transactionCapacity = 10000 

这里例如重复的行:

{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":539321584226680833,"in_reply_to_user_id_str":null,"timestamp_ms":"1417419260447","in_reply_to_status_id":null,"created_at":"Mon Dec 01 07:34:20 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"Testing Engineer, Hyderabad/Secunderabad, 2 - 5 Year Exp,Software Test Engineer , &amp;#x22;Big Data&amp;#x22;... http://t.co/DAK1ilWhM5","contributors":null,"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"http://bit.ly/1ttBxPY","indices":[116,138],"display_url":"bit.ly/1ttBxPY","url":"http://t.co/DAK1ilWhM5"}],"hashtags":[{"text":"x22","indices":[89,93]},{"text":"x22","indices":[107,111]}],"user_mentions":[]},"source":"<a href=\"http://monsterindia.com\" rel=\"nofollow\">IT jobs, India<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"539321584226680833","user":{"location":"India","default_profile":false,"profile_background_tile":false,"statuses_count":63546,"lang":"en","profile_link_color":"0084B4","id":123537533,"following":null,"protected":false,"favourites_count":0,"profile_text_color":"333333","verified":false,"description":"Get latest job opportunities in Indian IT industry","contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"IT Jobs, India","profile_background_color":"C0DEED","created_at":"Tue Mar 16 11:48:44 +0000 2010","default_profile_image":false,"followers_count":1245,"profile_image_url_https":"https://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":0,"profile_sidebar_fill_color":"DDEEF6","screen_name":"tech_career","id_str":"123537533","profile_image_url":"http://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","listed_count":43,"is_translator":false}} 
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":539321584226680833,"in_reply_to_user_id_str":null,"timestamp_ms":"1417419260447","in_reply_to_status_id":null,"created_at":"Mon Dec 01 07:34:20 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"Testing Engineer, Hyderabad/Secunderabad, 2 - 5 Year Exp,Software Test Engineer , &amp;#x22;Big Data&amp;#x22;... http://t.co/DAK1ilWhM5","contributors":null,"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"http://bit.ly/1ttBxPY","indices":[116,138],"display_url":"bit.ly/1ttBxPY","url":"http://t.co/DAK1ilWhM5"}],"hashtags":[{"text":"x22","indices":[89,93]},{"text":"x22","indices":[107,111]}],"user_mentions":[]},"source":"<a href=\"http://monsterindia.com\" rel=\"nofollow\">IT jobs, India<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"539321584226680833","user":{"location":"India","default_profile":false,"profile_background_tile":false,"statuses_count":63546,"lang":"en","profile_link_color":"0084B4","id":123537533,"following":null,"protected":false,"favourites_count":0,"profile_text_color":"333333","verified":false,"description":"Get latest job opportunities in Indian IT industry","contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"IT Jobs, India","profile_background_color":"C0DEED","created_at":"Tue Mar 16 11:48:44 +0000 2010","default_profile_image":false,"followers_count":1245,"profile_image_url_https":"https://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/88067227/IT1.jpg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":0,"profile_sidebar_fill_color":"DDEEF6","screen_name":"tech_career","id_str":"123537533","profile_image_url":"http://pbs.twimg.com/profile_images/790482269/sm_it1_normal.jpg","listed_count":43,"is_translator":false}} 

回答

1

水槽没有任何的ID添加到数据它将要存储。 HDFS也是如此,它在存储数据时不会添加任何标识。他们只是一起工作,以移动生成的数据并存储它。

如果您存储具有相同ID的推文,那是因为您使用这些ID接收数据,或者您以错误的方式解读数据。

被说,也许你可以通过编辑添加一些例子到你的问题。

+0

在存储文件中(来自水槽),我可以看到多行文本相同。 – UNIm95

+1

您能否将所有Flume配置添加到问题中?另外,请您与我们分享一下您评论过的复制线? – frb

+0

完成。我添加了我的水槽配置和重复的行。 – UNIm95

相关问题