2017-02-21 52 views
4

要产生朱莉娅词二元语法,我可以简单地通过原始列表和下降的第一个元素的列表,如ZIP:生成的n-gram与朱莉娅

julia> s = split("the lazy fox jumps over the brown dog") 
8-element Array{SubString{String},1}: 
"the" 
"lazy" 
"fox" 
"jumps" 
"over" 
"the" 
"brown" 
"dog" 

julia> collect(zip(s, drop(s,1))) 
7-element Array{Tuple{SubString{String},SubString{String}},1}: 
("the","lazy") 
("lazy","fox") 
("fox","jumps") 
("jumps","over") 
("over","the") 
("the","brown") 
("brown","dog") 

要生成一个卦,我可以使用相同的collect(zip(...))成语来获得:

julia> collect(zip(s, drop(s,1), drop(s,2))) 
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox") 
("lazy","fox","jumps") 
("fox","jumps","over") 
("jumps","over","the") 
("over","the","brown") 
("the","brown","dog") 

但我必须手动在第三列表中通过压缩增加,有一个惯用的方式,这样我可以做ň -gram的任何命令?

例如我想避免这样做,以提取5克:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4))) 
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox","jumps","over") 
("lazy","fox","jumps","over","the") 
("fox","jumps","over","the","brown") 
("jumps","over","the","brown","dog") 

回答

4

这是一个干净的单线程的任何长度的克。

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...)) 

它使用一个发电机解析来遍历元素,k的数量,以drop。然后,使用splat(...)运算符,它将Drop解包为zip,最后将collect解包为Array

julia> ngram(s, 2) 
7-element Array{Tuple{SubString{String},SubString{String}},1}: 
("the","lazy") 
("lazy","fox") 
("fox","jumps") 
("jumps","over") 
("over","the") 
("the","brown") 
("brown","dog") 

julia> ngram(s, 5) 
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}: 
("the","lazy","fox","jumps","over") 
("lazy","fox","jumps","over","the") 
("fox","jumps","over","the","brown") 
("jumps","over","the","brown","dog") 

正如你所看到的,这是非常相似的解决方案 - 只添加一个简单的解析来遍历元素的数量drop,使得其长度可以是动态的。

+0

很酷!谢谢@HarrisonGrodin,不知道'drop(s,0)'是可能的=) – alvas

+1

@alvas没问题!而且,在“drop(s,0)”不可行的情况下,以下操作将起作用。 :)'zip(s,(drop(s,k)for k = 1:n-1)...)' –

5

另一种方法是使用Iterators.jlpartition()

ngram(s,n) = collect(partition(s, n, 1)) 
4

稍微改变了输出和使用,而不是Tuple小号SubArray S,小损失,但它有可能避免分配和内存复制。如果底层单词列表是静态的,这是可以的并且更快(在我的基准测试中)。的代码:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1] 

和输出:

julia> ngram(s,5) 
SubString{String}["the","lazy","fox","jumps","over"] 
SubString{String}["lazy","fox","jumps","over","the"] 
SubString{String}["fox","jumps","over","the","brown"] 
SubString{String}["jumps","over","the","brown","dog"] 

julia> ngram(s,5)[1][3] 
"fox" 

对于较大的单词表中的存储器要求是相当小的也。

另请注意,使用生成器允许以更快的速度和更少的内存逐个处理ngrams,并且可能足够用于所需的处理代码(计数某物或通过一些散列)。例如,使用@ Gnimuc的解决方案,而没有collect,即只有partition(s, n, 1)