2017-09-26 63 views
0

我正在处理OCR'd pdf文件并从中提取文本并从中创建数据框,我得到的是矢量,我无法将它们连接到一行,以便它可以作为列添加到数据框。从这个块的代码,我提取列数据框我将列添加到r中的数据框时出错

chk_words=c("Swimming pool","Gym","west","para") 
tp_big=c("swimming pool in a farm","gym","west","north","south") 
ps=c() 
x=list() 
for(i in chk_words){ 
    br=if(length(which(stri_detect_fixed(tolower(tp_big),tolower(i)))) <= 0){ print("Not Present") } else {print("Present")} 

    if(br == "Present") 
    ps=i 
    x[[i]]=ps 
    tc=unlist(unique(x)) 
    x=paste(tc,collapse=" ") 
    } 


df11=data.frame(x) 

我得到的输出(数据帧)为

x 
Swimming pool Gym west 

但是当我试图实现在这个大的代码我也是上面的代码没能获得所需的列“X” 这是代码整片

library(pdftools) 
    library(tesseract) 
    library(stringi) 
    library(TraMineRextras) 
     All_files=Sys.glob("*.pdf") 
v1 <- numeric(length(All_files)) 
chk_words=c("Swimming pool","Gym","west","para") 
word <- "Gym" 
tc=c() 
ps=c() 
x=list() 
df <- data.frame() 
df11 <- data.frame() 
Status="Present" 

for (i in seq_along(All_files)){ 


    file_name <- All_files[i] 

    cnt <- pdf_info(All_files[i])$pages 
    print(cnt) 

    for(j in seq_len(cnt)){ 
    img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400) 
    text <- ocr(img_file) 
    ocr_text <- capture.output(cat(text)) 
    check <- sapply(ocr_text, paste, collapse="") 
    junk <- dir(path="D:/Deepesh/R Script/All_PDF_Files/Registration_Certificates_OCR", pattern="tiff") 
    file.remove(junk) 
    br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present" 
    else "Present" 
    print(br)  
    if(br=="Present") { 
     v1[i] <- j 
     break} 

    for(k in chk_words){ 
     sr=if(length(which(stri_detect_fixed(tolower(check),tolower(k)))) <= 0){ print("Not Present") } else {print("Present")} 
     if(sr == "Present") 
     ps=k 
     x[[k]]=ps 
     tc=unlist(unique(x)) 

    } 




    } 
    y=paste(tc,collapse=" ") 
    #tc=paste(tc,collapse=" ") 
    Status <- if(v1[i] == 0) "Not Present" else "Present" 
    pages <- if(v1[i] == 0) "-" else 
    paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i]) 
    words <- if(v1[i] == 0) "-" else word 
    df <- rbind(df, cbind(file_name = basename(file_name), 
         Status, pages = pages, words = words,y)) 


} 

现在我得到这样的输出(赋给y为NULL)

FILE_NAME状态页字Y test1.pdf目前test1_1健身房
test2.pdf不存在 - 我期望是

file_name status   pages    words  y 
test1.pdf Present  test1_1    gym   swimming pool, gym 
test2.pdf Not Present  - 

任何建议,其中M我去错了。 在此先感谢。

P.S here可以访问样本pdf文件;更清晰在this post

回答

0
checkList = list() 
j=0 
for(i in chk_words){ 
    chk=Reduce('|', lapply(i, function(x) any(ocr_text %in% x))) 
    if(chk == "TRUE") { 
    j = j + 1; 
    checkList[[j]] <- i 
    } 
} 
THIRD_COL <- cat(paste(shQuote(unlist(checkList), type="cmd"), collapse=", ")) 

提到这会给你"swimming pool", "gym" 我做什么,如果条件满足,将在检查表chk_words存储(这是一个列表)。然后,我在paste中使用shQuote来返回所需的输出。

+0

更新了w.r.t的答案,但仍无法正确获取“THIRD_COL”列。 – deepesh

+0

你需要调试你的代码,因为你已经把它放在两个嵌套for循环和if语句之下。我没有看到我的代码块无法工作的任何原因。 – Santosh

+0

在我执行的代码上找不到任何东西 – deepesh

相关问题