Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm working with the enron dataset, and I'm interested on extract the clean body of the emails to a list keeping each answer as a string in the list. E.G.

For the following email:

Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: John_Arnold_Dec2000Notes Folders'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

So, what is it?   And by the way, don't start with the excuses.   You're 
expected to be a full, gourmet cook.

Kisses, not music, makes cooking a more enjoyable experience.  




"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:  
Subject: Hi


I told you I have a long email address.

I've decided what to prepare for dinner tomorrow.  I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience.

Watch the debate if you are home tonight.  I want a report tomorrow...
Jen

___________________________________________________________________
To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax,
all in one place - sign up today at http://www.zdnetonebox.com

I want to get the following response:

["So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience.", 
"I told you I have a long email address. I've decided what to prepare for dinner tomorrow.  I hope you aren't 
expecting anything extravagant because my culinary skills haven't been
put to use in a while.  My only request is that your stereo works.  Music
makes cooking a more enjoyable experience. Watch the debate if you are home tonight.  I want a report tomorrow...
Jen"]

Where the first element in the list is:

"So what is it?   And by the way  don't start with the excuses.   You're 
expected to be a full  gourmet cook. Kisses  not music  makes cooking a more enjoyable experience."

Is there a library capable of doing this?

I have tried with the python email library, but I does not seem to have that functionality, since I get the full body as response:

import email
message = data_
e = email.message_from_string(message)
print (e.get_payload())

So, what is it? And by the way, don't start with the excuses.
You're expected to be a full, gourmet cook. Kisses, not music, makes cooking a more enjoyable experience. "Jennifer White" jenwhite7@zdnetonebox.com on 10/17/2000 04:19:20 PM To: jarnold@enron.com cc: Subject: Hi I told you I have a long email address. I've decided what to prepare for dinner tomorrow. I hope you aren't expecting anything extravagant because my culinary skills haven't been put to use in a while. My only request is that your stereo works. Music makes cooking a more enjoyable experience. Watch the debate if you are home tonight. I want a report tomorrow... Jen ___________________________________________________________________ To get your own FREE ZDNet Onebox - FREE voicemail, email, and fax, all in one place - sign up today at http://www.zdnetonebox.com '

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
123 views
Welcome To Ask or Share your Answers For Others

1 Answer

I'm going to assume that you have all the Enron email messages in a .csv file, which is a common format for this dataset. I noted some data cleansing issues when processing this single message, mostly around the the " " in the message. I'm trying to figure out how to resolve this small issue.

import re as regex

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
      return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))


def parse_raw_email_message(raw_message):
   lines = raw_message.splitlines()
   email = {}
   message = ''
   keys_to_extract = ['from', 'to']
   for line in lines:
      if ':' not in line:
        message += line
        email['body'] = message

      else:
         pairs = line.split(':')
         key = pairs[0].lower()
         val = pairs[1].strip()
         if key in keys_to_extract:
            email[key] = val
   return email

###############################################
# change this open section to fit your dataset
###############################################
with open('enron_emails/sample_email.txt', 'r') as in_file:
   parsed_email = parse_raw_email_message(in_file.read())
   for key, value in parsed_email.items():
     if key == "body":
        # this regex add whitespace around single periods and words that end in 't.
        first_cleaning = regex.sub(r"(?<=('t)(?=[^s]))|(?<=[.,])(?=[^s])", r' ', value)
        cleaned_body = expunge_doublespaces(first_cleaning)
        print(cleaned_body)
        # print output
        So, what is it? And by the way, don't start with the excuses. You're
        expected to be a full, gourmet cook. Kisses, not music, makes cooking
        a more enjoyable experience. I told you I have a long email address.
        I've decided what to prepare for dinner tomorrow. I hope you aren't
        expecting anything extravagant because my culinary skills haven't 
        beenput to use in a while. My only request is that your stereo works. 
        Musicmakes cooking a more enjoyable experience. Watch the debate if 
        you are home tonight. I want a report tomorrow. . . Jen

UPDATE

Here is another way to obtain the body of the email message. There are other examples in another question that I answered.

import re as regex
import email

def expunge_doublespaces(raw_string):
   if '  ' not in raw_string:
     return raw_string
   return expunge_doublespaces(raw_string.replace('  ', ' '))

with open('enron_emails/sample_email.txt', 'r') as input:
    email_body = ''
    raw_message = input.read()

    # Return a message object structure from a string
    msg = email.message_from_string(raw_message)

    # iterate over all the parts and subparts of a message object tree
    for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
      email_body = part.get_payload()
      first_cleaning = regex.sub(r"((Ww+W).*(d{2}:d{2}:d{2})s(AM|PM)
(To:.*)
(cc:.*)
(Subject:.*))", r' ',
                     email_body)
      clean_body = expunge_doublespaces(first_cleaning.replace('
', ' '))
      print(clean_body)
      # print output
      So, what is it? And by the way, don't start with the excuses. 
      You're expected to be a full, gourmet cook. Kisses, not music, 
      makes cooking a more enjoyable experience. I told you I have a 
      long email address. I've decided what to prepare for dinner 
      tomorrow. I hope you aren't expecting anything extravagant 
      because my culinary skills haven't been put to use in a while. 
      My only request is that your stereo works. Music makes cooking a 
      more enjoyable experience. Watch the debate if you are home 
      tonight. I want a report tomorrow... Jen 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...