2013-04-26 140 views
-3

我试图匹配两个非常大的数据(nsar & crsp)集。我的代码工作得很好,但需要很长时间。我的程序的工作方式如下:通过股票(从而控制NAV(只是一个数字)&日期 是一样的),通过精确基金名称 提高R脚本效率

  • 尝试匹配

    1. 尝试匹配(在控制了NAV &日)由最接近的匹配
    2. 尝试匹配:首先搜索相同的NAV &日期 - >采取列表,只考虑有两个匹配的措施最匹配的公司 - >取剩余的条目,并找到最接近的匹配(但比赛距离限制)。

    任何建议,我怎么能提高代码的效率:

    #Go through each nsar entry and try to match with crsp 
    trackchanges = sapply(seq_along(nsar$fund),function(x){ 
    
        #Define vars 
        ticker = nsar$ticker[x] 
        r_date = format(nsar$r_date[x], "%m%Y") 
        nav1 = nsar$NAV_share[x] 
        nav2 = nsar$NAV_sshare[x] 
        searchbyname = 0 
    
        if(nav1 == 0) nav1 = -99 
        if(nav2 == 0) nav2 = -99 
    
        ########## If ticker is available --> Merge via ticker and NAV 
        if(is.na(ticker) == F) 
        { 
    
         #Look for same NAV, date and ticker 
         found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
    
    
         #If nothing found 
         if(length(found) == 0) 
         { 
    
          #Mark that you should search by names 
          searchbyname = 1 
    
         } else { #ticker found 
    
            #Record crsp_fundno and that match is found 
          nsar$match[x] = 1 
          nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
          assign("nsar",nsar,envir=.GlobalEnv) 
    
          #Return: 1 --> Merged by ticker 
          return(1) 
         } 
    
        } 
    
        ########### 
    
        ########### No Ticker available or found --> Exact name matching 
        if(is.na(ticker) == T | searchbyname == 1) 
        { 
    
         #Define vars 
         name = tolower(nsar$fund[x]) 
         company = tolower(nsar$company[x]) 
    
         #Exact name, date and same NAV 
         found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
    
    
    
         #If nothing found 
         if(length(found) == 0) 
         { 
    
          #####Continue searching by closest match 
    
           #First search for nav and date to get list of funds 
           allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1))) 
           allfunds_companies = crsp$company[allfunds] 
    
           #Check if anything found 
           if(length(allfunds) == 0) 
           { 
            #Return: 0 --> nothing found 
            return(0) 
           } 
    
           #Get best match by lev and substring measure for company 
           levmatch = levenstheinMatch(company, allfunds_companies) 
           submatch = substringMatch(company, allfunds_companies) 
    
           allfunds = levmatch[levmatch %in% submatch] 
           allfunds_names = crsp$fund_name2[allfunds] 
    
           #Check if now anything found 
           if(length(allfunds) == 0) 
           { 
            #Mark match (5=Company not found) 
            nsar$match[x] = 5 
    
            #Save globally 
            assign("nsar",nsar,envir=.GlobalEnv) 
    
            #Return: 5 --> Company not found 
            return(5) 
           } 
    
    
           #Get best match by all measures 
           levmatch = levenstheinMatch(name, allfunds_names) 
           submatch = substringMatch(name, allfunds_names) 
    
    
           #Only accept if identical 
           allfunds = levmatch[levmatch %in% submatch] 
           allfunds_names = crsp$fund_name2[allfunds] 
    
    
           if(length(allfunds) > 0) 
           { 
            #Mark match (3=closest name matching) 
            nsar$match[x] = 3 
    
            #Add crsp_fundno to nsar data 
            nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]] 
    
            #Save globally 
            assign("nsar",nsar,envir=.GlobalEnv) 
    
            #Return 3=closest name matching 
            return(3) 
    
           } else { 
            #return 0 -> no match 
            return(0) 
           } 
    
          ##### 
    
         } else { #If exact name,date,nav found 
    
          #Mark match (2=exact name matching) 
          nsar$match[x] = 2 
    
          #Add crsp_fundno to nsar data 
          nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
    
          #Return 2=exact name matching 
          return(2) 
         } 
        } 
    
    
    
    
    
    })#End sapply 
    

    非常感谢您的帮助! Laurenz

  • +0

    你能发布一个更简单,可重复的例子吗? – Nishanth 2013-04-26 12:36:06

    +0

    一些一般建议。少写评论,但将工作流程切入功能。这样你的中央环路可能会在十条左右。这使您的主要想法易于掌握,并且细节包含在功能中。 – 2013-04-26 14:00:18

    回答

    2

    剧本太复杂,以提供一个完整的答案,但基本的问题是在第一线

    #Go through each nsar entry... 
    

    ,你在一个迭代的方式设置出了问题。 R最适合于矢量。

    提升机sapply开始进行计算的可矢量化组件。例如,格式化r_date

    nsar$r_date_f <- format(nsar$r_date, "%m%Y") 
    

    该建议适用于线埋在你的代码更深入,太,例如计算圆的CRSP $ mnav应在整列刚刚完成一次

    crsp$mnav_r <- round(crsp$mnav, 1) 
    

    使用[R成语哪里适当的,如果“-99”代表缺失值,然后用NA

    nav1 <- nsar$NAV_share 
    nav1[nav1 == -99] <- NA 
    nasr$nav1 <- nav1 
    

    代码从其他的包,你可以使用更容易治疗NA正确。

    使用成熟的R的功能和对于更复杂的查询。这是棘手的,但如果我正确地读你的代码,你对“同NAV,日期,股票代码”查询可以使用merge做连接,假定列已通过前面的代码矢量操作创建,如

    nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE] 
    df0 <- merge(nasr1, crsp, 
          by.x = c("ticker", rdate_r", "nav1_r"), 
          by.y = c("nasdaq", "caldt2", "mnav_r")) 
    

    这并不包括 “|”条件,所以需要额外的工作。 plyr,data.table和sqldf包(以及其他)的开发部分是为了简化这些类型的操作,因此在您更加熟悉向量化计算时可能值得研究。

    这很难说,但我觉得这三个步骤解决您的代码的主要挑战。