2017-07-30 56 views
3

我有像这样一串字符串:我可以合理分割这些数字字符串吗?

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

他们分数和分数表示的百分比值,它在某种程度上得到了某处一起捣成泥。所以这个例子中第一个数字的含义是7个中有4个是57.1%。我可以很容易地在/之前得到第一个数字(例如,stringr::word(x, 1, sep = "/")),但第二个数字可以是一个或两个字符长,所以我很难想出一个办法来做到这一点。我不需要%值,因为一旦获得数字,这很容易重新计算。

任何人都可以看到一种方式吗?

回答

1

那种难看的-A解决方案,似乎做你想要什么:

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

split_perc <- function(x,signif_digits=1){ 
    x = gsub("%","",x) 
    if(grepl("-",x)) return(list(NA,NA)) 
    index1 = gregexpr("/",x)[[1]][1]+1 
    index2 = gregexpr("\\.",x)[[1]][1]-2 
    if(index2==-3){index2=nchar(x)-1} 

    found=FALSE 
    indices = seq(index1,index2) 
    k=1 
    while(!found & k<=length(indices)) 
    { 
    str1 =substr(x,1,indices[k]) 
    num1=as.numeric(strsplit(str1,"/")[[1]][1]) 
    num2 = as.numeric(strsplit(str1,"/")[[1]][2]) 
    value1 = round(num1/num2*100,signif_digits) 
    value2 = round(as.numeric(substr(x,indices[k]+1,nchar(x))),signif_digits) 
    if(value1==value2) 
    {found=TRUE} 
    else 
    {k=k+1} 
    } 
    if(found) 
    return(list(num1,num2)) 
    else 
    return(list(NA,NA)) 
} 

do.call(rbind,lapply(x,split_perc)) 

输出:

 [,1] [,2] 
[1,] 4 7 
[2,] 0 1 
[3,] 6 10 
[4,] NA NA 
[5,] 11 20 

几个例子:

y = c("11/2055.003%","11/2055.2%","40/7057.1%") 
do.call(rbind,lapply(y,split_perc)) 

    [,1] [,2] 
[1,] 11 20 # default significant digits is 1, so match found. 
[2,] NA NA # no match found since 55.1!=55.2 
[3,] 40 70 
+0

非常感谢! – Mart

+0

奇怪的是,我只是在一个月后发现了一个bug - “11/11100%”似乎有问题,应该是11和11,但是这个函数返回11和1.我不好,因为没有给出100个例子%在开始。但是目前为止,所有其他案例都完美无缺--10,10,10,11,11,12,12和12。 – Mart

0

正如你所指出的,一旦你有分数,百分比就可以重新计算。你能利用这个事实弄清楚拆分应该在哪里吗?

GuessSplit <- function(string) { 

    tolerance <- 0.001 #How close should the fraction be? 
    numerator <- as.numeric(word(string, 1, sep = "/")) 
    second.half <-word(string, 2, sep = "/") 
    second.half <- strsplit(second.half, '')[[1]] 

    # assuming they all end in percent signs 
    possibilities <- length(second.half) - 1 

    for (position in 1:possibilities) { 

    denom.guess <- as.numeric(paste0(second.half[1:position], collapse='')) 
    percent.guess <- as.numeric(paste0(second.half[(position+1):possibilities], collapse=''))/100 

    value <- numerator/denom.guess 

    if (abs(value - percent.guess) < tolerance) { 

     return(list(numerator=numerator, denominator=denom.guess)) 

    } 
    } 
} 

这需要一点爱来处理怪异的情况,如果它无法找到答案的可能性,可能更优雅。我也不确定什么样的退货类型是最好的。也许你只需要分母,因为分子很容易得到,但我认为两者的列表将是最普遍的。我希望这是一个合理的开始?

1

从溶液tidyversestringr。我们可以定义一个函数来分解第二个数字的所有可能位置,并计算百分比以查看哪一个有意义。 df2是显示最佳分割位置的数据框,您需要的数字位于V3列中。

library(tidyverse) 
library(stringr) 

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%") 

dt <- str_split_fixed(x, pattern = "/", n = 2) %>% 
    as_data_frame() %>% 
    mutate(ID = 1:n()) %>% 
    select(ID, V1, V2) 

# Design a function to spit the second column based on position 
split_df <- function(position, dt){ 
    dt_temp <- dt %>% 
    mutate(V3 = str_sub(V2, 1, position)) %>% 
    mutate(V4 = str_sub(V2, position + 1)) %>% 
    mutate(Pos = position) 

    return(dt_temp) 
} 

# Process the data 
dt2 <- map_df(1:3, split_df, dt = dt) %>% 
    # Remove % in V4 
    mutate(V4 = str_replace(V4, "%", "")) %>% 
    # Convert V1, V3 and V4 to numeric 
    mutate_at(vars(V1, V3, V4), funs(as.numeric)) %>% 
    # Calculate possible percentage 
    mutate(V5 = V1/V3 * 100) %>% 
    # Calculate the difference between V4 and V5 
    mutate(V6 = abs(V4 - V5)) %>% 
    # Select the smallest difference based on V6 for each group 
    group_by(ID) %>% 
    arrange(ID, V6) %>% 
    slice(1) 

# The best split is now in V3 
dt2$V3 
[1] 7 1 10 0 20