python map/lambda和ascii error

这是我第一次做文本挖掘项目并使用Panda。我试图在下载的生活推文（json格式）中的“文本”标签中收集所有字符串，以便我可以标记所有推文并计算高频词。这里是在JSON格式的样品鸣叫：python map/lambda和ascii error

{ 
    "contributors": null, 
    "truncated": false, 
    "text": "Hey Don : TheCougCoach :) Want to get iPh0ne 6 for FREE? Kindly check my bi0. Thx https://t.co/c38b8vqq2O", 
    "is_quote_status": true, 
    "in_reply_to_status_id": null, 
    "id": 659549062023262209, 
    "favorite_count": 0, 
    ...... skip 
    }, 
    "quoted_status_id": 659548944251228160, 
     "retweeted": false, 
     "coordinates": null, 
     "timestamp_ms": "1446083724872", 
     "quoted_status": { 
      "contributors": null, 
      "truncated": false, 
      "text": "I understand He is a criminal but Donald has all the right to be in the discussion. https://t.co/qv3oScGA1U", 
      "is_quote_status": true, 
      "in_reply_to_status_id": null,

这是我的代码（Python 2.7版+熊猫0.17.0或更新）：

import json 
import pandas as pd 
tweets_data_path = 'tweet.txt' 
tweets_data = [] 
tweets_file = open(tweets_data_path, "r") 
for line in tweets_file: 
    try: 
     tweet = json.loads(line) 
     tweets_data.append(tweet) 
    except: 
     continue 

tweets = pd.DataFrame() 

tweets['text'] = map(lambda tweet: tweet['text'], tweets_data) 

print tweets['text'] 

print tweets['text'].astype(str) # Try to convert the panda series into strings so I can tokenize the tweets (strings after "text" in the json format) using regular expression

这里是输出

0  Hey Don : TheCougCoach :) Want to get iPh0ne 6... 
1  I understand He is a criminal but Donald has a... 
Name: text, dtype: object 

UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)

两个问题：

（1） tweets = pd.DataFrame（）

tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)

这里panda和map/lambda提供了一个简单的方法来获取推文json文件中“文本”后的数据。但是，“map”只允许匹配的列表长度，使得输出未完成（使用...结束）。有没有更好的方法来编码？

（2）

UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)

好像输入 “tweet.txt” 是统一码的，所以我们遇到的错误？如果是，我们是否应该在阅读时编码“tweet.txt”？实际的输入文件非常大（几GB甚至更大），那么是否有更有效的方法来解决此问题？谢谢。

来源

2015-10-30 Chubaka

请勿逐行加载JSON文件。 json模块支持加载文件一气呵成：`

with open(tweets_data_path) as fp: 
    tweets_data = json.load(fp)

现在通过tweets_data步骤，你通常会通过列表和http://stardict.sourceforge.net/Dictionaries.php下载步骤。

问题是，在每个键值输入之后，JSON并不一定需要换行;文本文件恰好具有这种格式的事实很好，但你不应该依赖它。

至于unicode的问题，我会建议使用Python 3，而不是绕过一堆这些问题。
的JSON module documentation for Python 2说以下，虽然：

如果FP的内容被以比UTF-8以外的基于ASCII编码的（例如拉丁-1），那么必须指定一个适当的编码名。不是基于ASCII的编码（如UCS-2）是不允许的，应该用codecs.getreader（encoding）（fp）包装，或者简单地解码为一个unicode对象并传递给loads（）。

来源

2015-10-30 07:31:57 Evert

python map/lambda和ascii error

回答

相关问题