2016-12-27 35 views
1

我有下面的文字和需要之前和文本提取在R 2与stringi包

实施例的特定词后,以提取特定的单词:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n > quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of" 
library(stringi) 
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE) 

实际输出低于

[[1]] 
[1] "engineering plastics" 

[[2]] 
[1] "iso 9001" 

[[3]] 
[1] "office automation" 

需要的输出:

[1] globally expanding its engineering plastics centered on polycarbonate resin 
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of 

基本上需要之前和之后我特定的词提到

+0

您对'stri_extract_all_fixed'的调用引用了未定义的变量'prav_1'。请让你的例子具有可重现性。 – drammock

+0

所有文字都在你的特定单词之前或之后。你似乎要在“工程塑料”之前3个单词,之后4个单词;在“iso 9001”之前有2个词,并且在...之后有相当多的话你有一个可靠的逻辑,你可以解释你想要提取的前后有多少? – Gregor

+0

请将prav_1改为sometext –

回答

0

这是一些主意,开始与提取文本:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n > quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of" 
library(stringi) 
words <- c('engineering plastics', 'iso 9001','office automation') 
pattern <- stri_paste("([^ ]+){0,10}", words, "([^ ]+){0,10}") 
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE) 

说明: 之前和之后您需要的话,我将简单的正则表达式:

"([^ ]+){0,10}" 

这意味着:

  1. 什么,但空间,重复多次,你可以
  2. 然后空间
  3. 而这一切高达十倍

这是非常简单和幼稚的(例如,它把所有的“&”或“>”作为单词),但工作。