2016-06-30 43 views
0

我有一个Outlook PST文件,并且想要获取电子邮件的json,例如像使用libpst将Outlook PST转换为json

{"emails": [ 
{"from": "[email protected]", 
"to": "[email protected]", 
"bcc": "[email protected]", 
"subject": "mitm", 
"content": "be careful!" 
}, ...]} 

我使用readpst转换为MH格式,然后在Ruby/Python的/ bash脚本扫描它认为,有没有更好的办法?

不幸的是,ruby-msg宝石不适用于我的PST文件(看起来它自2014年以来未更新)。

回答

0

我找到了一种方法来做到这一点在2个阶段,首先转换成mbox中,然后JSON:

# requires installing libpst 
pst2json my.pst 
# or you can specify a custom output dir and an outlook mail folder, 
# e.g. Inbox, Sent, etc. 
pst2json -o email/ -f Inbox my.pst 

哪里pst2json是我的脚本,mbox2json稍微从Mining the Social Web修改。

pst2json

#!/usr/bin/env bash 

usage(){ 
    echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>" 
    echo "default output-dir: email/mbox-all/<pst-file>" 
    echo "default folder: Inbox" 
    exit 1 
} 

which readpst || { echo "Error: libpst not installed"; exit 1; } 
folder=Inbox 

while (($# > 0)); do 
    [[ -n "$pst_file" ]] && usage 
    case "$1" in 
     -o) 
      if [[ -n "$2" ]]; then 
       out_dir="$2" 
       shift 2 
      else 
       usage 
      fi 
      ;; 
     -f) 
      if [[ -n "$2" ]]; then 
       folder="$2" 
       shift 2 
      else 
       usage 
      fi 
      ;; 
     *) 
      pst_file="$1" 
      shift 
    esac 
done 

default_out_dir="email/mbox-all/$(basename $pst_file)" 
out_dir=${out_dir:-"$default_out_dir"} 
mkdir -p "$out_dir" 
readpst -o "$out_dir" "$pst_file" 
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; } 
res="$out_dir"/"$folder".json 
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res" 

mbox2json(蟒蛇2.7):现在

# -*- coding: utf-8 -*- 

import sys 
import mailbox 
import email 
import quopri 
import json 
from BeautifulSoup import BeautifulSoup 

MBOX = sys.argv[1] 
OUT_FILE = sys.argv[2] 
SKIP_HTML=True 

def cleanContent(msg): 

    # Decode message from "quoted printable" format 

    msg = quopri.decodestring(msg) 

    # Strip out HTML tags, if any are present 

    soup = BeautifulSoup(msg) 
    return ''.join(soup.findAll(text=True)) 


def jsonifyMessage(msg): 
    json_msg = {'parts': []} 
    for (k, v) in msg.items(): 
     json_msg[k] = v.decode('utf-8', 'ignore') 

    # The To, CC, and Bcc fields, if present, could have multiple items 
    # Note that not all of these fields are necessarily defined 

    for k in ['To', 'Cc', 'Bcc']: 
     if not json_msg.get(k): 
      continue 
     json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r' 
       , '').replace(' ', '').decode('utf-8', 'ignore').split(',') 

    try: 
     for part in msg.walk(): 
      json_part = {} 
      if part.get_content_maintype() == 'multipart': 
       continue 
      type = part.get_content_type() 
      if SKIP_HTML and type == 'text/html': 
       continue 
      json_part['contentType'] = type 
      content = part.get_payload(decode=False).decode('utf-8', 'ignore') 
      json_part['content'] = cleanContent(content) 

      json_msg['parts'].append(json_part) 
    except Exception, e: 
     sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e),)) 
    finally: 
     return json_msg 

# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators 
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects 
class Encoder(json.JSONEncoder): 
    def default(self, o): 
     return {'emails': list(o)} 


# The generator itself... 
def gen_json_msgs(mb): 
    while 1: 
     msg = mb.next() 
     if msg is None: 
      break 
     yield jsonifyMessage(msg) 

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file) 
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder) 

,它可以很容易地处理文件。例如。获取电子邮件的内容:

jq '.emails[] | .parts[] | .content' < out/Inbox.json