2012-03-18 35 views
3

[R初学者用什么似乎是一个非常简单的问题: 我有一些电子邮件的日志,我已经在格式读入R:R:转换电子邮件地址为唯一整数

>log1 
    Date  Time   From     To 
1 2000-01-01 00:00:00 [email protected]   [email protected] 
2 2000-01-02 01:00:00 carolyn @mail.com  [email protected] 
3 2000-01-03 02:00:00 [email protected]   [email protected] 
4 2000-01-04 03:00:00 chris @mail.com   [email protected] 
5 2000-01-05 04:00:00 [email protected]   [email protected] 
6 2000-01-06 05:00:00 [email protected]   [email protected] 

我需要要将log1 $ From和log1 $ To更改为全局唯一数字标识符,以便稍后在其他日志中读取任何给定电子邮件地址时将收到与先前日志相同的标识符。

我曾尝试:

id <- as.numeric(as.character(log1[,3]))) 
id<-as.numeric(levels(log1[,3]))) 
id <- charToRaw(log1[,4]), base=16) 

会某种灵魂请帮我 - 谢谢!

道歉或许应该已经包括此:

Date=c("01/01/2000" ,"02/01/2000" ,"03/01/2000", "04/01/2000" ,"05/01/2000" ,"06/01/2000","07/01/2000","08/01/2000", 
    "09/01/2000","10/01/2000","11/01/2000", "12/01/2000" ,"13/01/2000", "14/01/2000", "15/01/2000","16/01/2000" 
    ,"17/01/2000","18/01/2000","19/01/2000","20/01/2000","01/01/2000","02/01/2000") 
    Time=c("00:00:00","01:00:00","02:00:00", "03:00:00" ,"04:00:00" ,"05:00:00", "06:00:00" ,"07:00:00", "08:00:00", "09:00:00" ,"10:00:00", 
    "11:00:00", "12:00:00","13:00:00", "14:00:00","15:00:00","16:00:00","17:00:00","18:00:00","19:00:00","00:00:00" ,"00:00:00") 
    From=c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", 
    "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", 
    "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", 
    "[email protected]","[email protected]","[email protected]","[email protected]") 
    To=c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", 
    "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]", 
    "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]") 
    log<-data.frame(Date=Date,Time=Time,From=From,To=To) 

在尝试使用MD5生成全局唯一标识符:注意[email protected]标识符是如何内ID_to正确的比赛,但不在范围在水平/因子方法ID_from

ID_to<-data.frame() 
    ID_from<-data.frame() 

    for (i in 1:nrow(log)){ 
    to<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,4], algo='md5'), 2), c(1, 9, 17, 25), c(8, 16, 24, 32)),sep="")) 
    (ID_to<-rbind(ID_to,to)) 

    from<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,3], algo='md5'), 2), c(1, 9, 17, 25),c(8, 16, 24, 32)),sep="")) 
    (ID_from<-rbind(ID_from,from)) 

    } 

    ID_to[,3]<-paste(ID_to[,1],ID_to[,2], sep="") 
    ID_from[,3]<-paste(ID_from[,1],ID_from[,2], sep="") 

    edgelist<-data.frame(ID_from[,3],log[,3],ID_to[,3],log[,4],log[,1],log[,2]) 
    print(edgelist) 
    ID_from...3.     log...3.   ID_to...3.   log...4. log...1. log...2. 
    27488842661591306920  [email protected] 18727221862165338513 [email protected] 01/01/2000 00:00:00 
    38124472891255273775 [email protected] 1251903296725454474  [email protected] 02/01/2000 01:00:00 
    29070047663451376630  [email protected] 17074276751156451031  [email protected] 03/01/2000 02:00:00 
    8261398433828474582 [email protected] 1563683670909194033  [email protected] 04/01/2000 03:00:00 
    18727221862165338513 [email protected] 26735368323826533112  [email protected] 05/01/2000 04:00:00 
    5680838251168988404  [email protected] 2923605896229594830  [email protected] 06/01/2000 05:00:00 
    2351312285811012730  [email protected] 17171333544033270402  [email protected] 07/01/2000 06:00:00 
    328278708432069254  [email protected] 33840664403556851587  [email protected] 08/01/2000 07:00:00 
    1127901879852039037 [email protected] 1973548136161209824  [email protected] 09/01/2000 08:00:00 
    7349515121496417787 [email protected] 5680838251168988404  [email protected] 10/01/2000 09:00:00 
    27488842661591306920  [email protected] 328278708432069254  [email protected] 11/01/2000 10:00:00 
    38124472891255273775 [email protected] 1127901879852039037  [email protected] 12/01/2000 11:00:00 
    29070047663451376630  [email protected] 27488842661591306920  [email protected] 13/01/2000 12:00:00 
    8261398433828474582 [email protected] 38124472891255273775  [email protected] 14/01/2000 13:00:00 
    18727221862165338513 [email protected] 29070047663451376630  [email protected] 15/01/2000 14:00:00 
    5680838251168988404  [email protected] 8261398433828474582  [email protected] 16/01/2000 15:00:00 
    2351312285811012730  [email protected] 2351312285811012730  [email protected] 17/01/2000 16:00:00 
    328278708432069254  [email protected] 7349515121496417787  [email protected] 18/01/2000 17:00:00 
    1127901879852039037 [email protected] 41762759923562968495  [email protected] 19/01/2000 18:00:00 
    7349515121496417787 [email protected] 24894056753582090007  [email protected] 20/01/2000 19:00:00 
    27488842661591306920  [email protected] 18727221862165338513 [email protected] 01/01/2000 00:00:00 
    27488842661591306920  [email protected] 18727221862165338513 [email protected] 02/01/2000 00:00:00 

尝试:

获得一个错误:

log <- union(levels(log[,3]), levels(log[,4])) 
>Error in emails[, 3] : incorrect number of dimensions 
+0

不太了解R,但是从你提到的内容来看,你正在寻找From和To电子邮件地址组合的唯一标识符。你可以尝试为它们的连接创建一个散列。 R似乎有一些散列函数,所以你可以尝试一下。 – Gangadhar 2012-03-18 15:09:43

+0

感谢您的输入家伙,当然有一个比实现校验和或hashmap更简单的解决方案?! – 2012-03-18 15:41:18

+0

只要您为每个输入获取唯一标识符,就可以使用任何算法(md5,sha,crc,..)。 – blejzz 2012-03-18 16:07:51

回答

1

您需要为日志中的每封电子邮件创建唯一的ID。一种方法是计算每封电子邮件的crc校验和,并将其用作标识符,但数字会很长。或者你可以在R中实现一个HashMap,并将该电子邮件作为HashMap的关键字。

2

您可以使用MD5生成全局唯一标识符,因为它的冲突概率非常低,但由于其输出为128位,因此需要一些数字来表示它(32位R中有4个整数,两个整数在64位R中)。不过,这应该很容易处理使用短数字向量。

这里是你如何生成的电子邮件地址这样的四个整数向量(或任何其他字符串为此事):

library(digest) 
email <- '[email protected]' 
as.numeric(paste('0x', substr(rep(hmac('secret56f8a7', email, algo='md5'), 4), c(1, 9, 17, 25), c(8, 16, 24, 32)), sep='')) 

你可以使用algo='crc32'和只获得一个整数,但是这个ISN我们不建议这样做,因为碰撞比CRC更有可能。

+0

感谢您的输入 - 不幸的是,这种解决方案似乎并没有为我工作。看看上面的输出,我(非常)可能会做出一些单子。谢谢 – 2012-03-19 00:04:08

1

我认为这会做你想要什么,这是有效的,你可以只使用基础包做...

步骤:

1.Convert的两列因素

2.以完全相同的方式联合因素级别,以便每个电子邮件在因素级别中都有唯一的ID。

3.将每列中的条目更改为与其因子级别对应的数字。因此,我们可以通过简单地在两栏中查找“1”来识别“[email protected]”发送和接收电子邮件的时间。

log1$From <- as.factor(log1$From) 
log1$To <- as.factor(log1$To) 
emails <- union(levels(log1$From), levels(log1$To)) 
levels(log1$From) <- emails 
levels(log1$To) <- emails 
log1$From <- as.numeric(log1$From) 
log1$To <- as.numeric(log1$To) 

如我在这里所做的那样,保留原始电子邮件地址的记录可能是一个好主意。然后,如果你有兴趣,比如发送哪些邮件[email protected]

log1[log1$From == which(emails == "[email protected]"), ] 

应该这样做!你可以编写一个程序,使它看起来更清洁...

+0

,感谢您的输入 - 不幸的是,我遇到了这个解决方案的错误。 – 2012-03-19 00:23:01

+0

不幸的是,这并不能确保在阅读不同的日志时,同一个电子邮件地址将获得与OP所需的相同的数字ID。 – 2012-03-19 09:39:23

相关问题