这是我第一次做文本挖掘项目并使用Panda。我试图在下载的生活推文(json格式)中的“文本”标签中收集所有字符串,以便我可以标记所有推文并计算高频词。这里是在JSON格式的样品鸣叫:python map/lambda和ascii error
{
"contributors": null,
"truncated": false,
"text": "Hey Don : TheCougCoach :) Want to get iPh0ne 6 for FREE? Kindly check my bi0. Thx https://t.co/c38b8vqq2O",
"is_quote_status": true,
"in_reply_to_status_id": null,
"id": 659549062023262209,
"favorite_count": 0,
...... skip
},
"quoted_status_id": 659548944251228160,
"retweeted": false,
"coordinates": null,
"timestamp_ms": "1446083724872",
"quoted_status": {
"contributors": null,
"truncated": false,
"text": "I understand He is a criminal but Donald has all the right to be in the discussion. https://t.co/qv3oScGA1U",
"is_quote_status": true,
"in_reply_to_status_id": null,
这是我的代码(Python 2.7版+熊猫0.17.0或更新):
import json
import pandas as pd
tweets_data_path = 'tweet.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets = pd.DataFrame()
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
print tweets['text']
print tweets['text'].astype(str) # Try to convert the panda series into strings so I can tokenize the tweets (strings after "text" in the json format) using regular expression
这里是输出
0 Hey Don : TheCougCoach :) Want to get iPh0ne 6...
1 I understand He is a criminal but Donald has a...
Name: text, dtype: object
UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)
两个问题:
(1) tweets = pd.DataFrame()
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
这里panda和map/lambda提供了一个简单的方法来获取推文json文件中“文本”后的数据。但是,“map”只允许匹配的列表长度,使得输出未完成(使用...结束)。有没有更好的方法来编码?
(2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)
好像输入 “tweet.txt” 是统一码的,所以我们遇到的错误?如果是,我们是否应该在阅读时编码“tweet.txt”?实际的输入文件非常大(几GB甚至更大),那么是否有更有效的方法来解决此问题?谢谢。