2015-11-25 25 views
0

目前,由于此question的回答者的帮助,我能够成功查询单词,并获得最受欢迎的后续单词列表。例如,用的是“伟大”,我能够得到如下格式的多达10个字的清单:在带有多个输入的trigrams上构造BigQuery

SELECT second, SUM(cell.page_count) total 
FROM [publicdata:samples.trigrams] 
WHERE first = "great" 
group by 1 
order by 2 desc 
limit 10 

随着输出:

second  total  
------------------ 
deal  3048832 
and  1689911 
,   1576341 
a   1019511 
number  984993  
many  875974  
importance 805215  
part  739409  
.   700694  
as   628978 

什么我目前遇到麻烦搞清楚如何是如何做到这一点查询自动多个单词(而不是调用每次一个单独的词的查询),这样我可能有一个输出,如:

"great"  total  "new_word_1"   new_total_1 ... "new_word_N"  new_total_N 
----------------------------------------------------------------------------------------- 
deal  3048832 "new_follow_on_word1" 123456  ... "follow_on_N1" 234567 
and  1689911 "new_follow_on_word2" 12345  ... "follow_on_N2" 123456 

基本上我可以在单个查询中调用N字数(例如,new_word_1是一个完全不同的单词,如“棒球”,没有与“伟大”的关系),并获取与每个单词相关的总计数在不同的列上。

此外,在了解了BigQuery的pricing之后,我也无法弄清楚如何尽可能限制查询的总数据。我可以考虑只使用最新的数据(比如2010年以后)和每字2个字母数字输出,但可能会丢失更明显的限制器。

回答

1

您可以在同一个查询中放置多个第一个单词,但它需要分别计算前10个后续单词,然后将结果连接在一起。这里是“伟大”和“棒球”的例子

SELECT word1, total1, word2, total2 FROM 
(SELECT ROW_NUMBER() OVER() rowid1, word1, total1 FROM (
SELECT second as word1, SUM(cell.page_count) total1 
FROM [publicdata:samples.trigrams] 
WHERE first = "great" 
group by 1 
order by 2 desc 
limit 10)) a1 
JOIN 
(SELECT ROW_NUMBER() OVER() rowid2, word2, total2 FROM (
SELECT second as word2, SUM(cell.page_count) total2 
FROM [publicdata:samples.trigrams] 
WHERE first = "baseball" 
group by 1 
order by 2 desc 
limit 10)) a2 
ON a1.rowid1 = a2.rowid2 
+0

我刚刚删除了我的答案,因为我意识到我错过了前10名的要求 –