2012-02-04 35 views
3

使用我有一个MySQL表命名为“内容”包含(AO)领域的“_DATE”和“文本”,例如:指望有多少次一个词被每

_date  text 
--------------------------------------------------------- 
2011-02-18 I'm afraid my car won't start tomorrow 
2011-02-18 I hope I'm going to pass my exams 
2011-02-18 Exams coming up - I'm not afraid :P 
2011-02-19 Not a single f was given this day 
2011-02-20 I still hope I passed, but I'm afraid I didn't 
2011-02-20 On my way to school :) 

我寻找一个查询来计算每天使用“希望”和“害怕”字样的次数。换句话说,输出必须是这样的:

_date  word count 
----------------------- 
2011-02-18 hope 1 
2011-02-18 afraid 2 
2011-02-19 hope 0 
2011-02-19 afraid 0 
2011-02-20 hope 1 
2011-02-20 afraid 1 

是否有一个简单的方法来做到这一点,或者我应该只写每学期我不同的查询?我现在有这个,但我不知道该用什么来代替“?”

SELECT COUNT(?) FROM content WHERE text LIKE '%hope' GROUP BY _date 

有人可以帮助满足正确的查询吗?

回答

3

我认为最容易和可重复的方法是制作subquerys:

Select 
    _date, 'hope' as word, 
    sum(case when `text` like '%hope%' then 1 else 0 end) as n 
from content 
group by _date 
UNION 
Select 
    _date, 'afraid' as word, 
    sum(case when `text` like '%afraid%' then 1 else 0 end) as n 
from content 
group by _date 

这种方法没有最好的表现。如果你正在寻找性能,你应该在白天分组子查询,这like条件是一个性能杀手。如果您仅以批处理模式一次执行查询,则这是一个解决方案。解释您的性能要求以获得准确的解决方案。

编辑以赛最后一场OP REQUERIMENT

+0

好的答案:)通过使用类似于我的答案的连接,您可以节省这些UNION,只需要将每个单词写一次即可。 – knittl 2012-02-04 17:37:19

2

您的查询几乎是正确的:

SELECT _date, 'hope' AS word, COUNT(*) as count 
FROM content WHERE text LIKE '%hope%' GROUP BY _date 

使用%hope%的字之间是否匹配(不仅在字符串的结尾)。 COUNT(*)应该做你想做的。

从单个查询获得多个单词,使用UNION ALL


另一种方法是动态创建的词序列,并把它作为第二个表的连接:

SELECT _date, words.word, COUNT(*) as count 
FROM (
    SELECT 'hope' AS word 
    UNION 
    SELECT 'afraid' AS word 
) AS words 
CROSS JOIN content 
WHERE text LIKE CONCAT('%', words.word, '%') 
GROUP BY _date, words.word 

请注意,它只会计算每个单词每个单词的单个出现次数。所以,»我希望还是有希望的«只会给你1,而不是2


要获得0当没有比赛,再次加入先前的结果与日期:

SELECT content._date, COALESCE(result.word, 'no match'), COALESCE(result.count, 0) 
FROM content 
LEFT JOIN (
SELECT _date, words.word, COUNT(*) as count 
FROM (
    SELECT 'hope' AS word 
    UNION 
    SELECT 'afraid' AS word 
) AS words 
CROSS JOIN content 
WHERE text LIKE CONCAT('%', words.word, '%') 
GROUP BY _date, words.word) AS result 
ON content._date = result._date 
+0

我没有投下你的答案,但我认为这不是OP所要求的。小心,重读问题。 – danihp 2012-02-04 17:03:52

+0

@danihp:我错在哪里?对我来说,这看起来正是OP想要的。 – knittl 2012-02-04 17:06:58

+0

我发布了对未编辑答案的评论,请参阅编辑时间和评论时间。马上。此外,可以在group by子句中包含“words.word”以获得标准的SQL查询,并将“text”括入引号中,因为它是保留字。 – danihp 2012-02-04 17:09:17

2

假设你要计算所有单词,并找到最常用的词(而不是找几个特定的​​单词计数),你可能想尝试的东西像以下存储过程(的this blog post字符串分割赞美):

DROP PROCEDURE IF EXISTS wordsUsed; 
DELIMITER // 
CREATE PROCEDURE wordsUsed() 
BEGIN 
    DROP TEMPORARY TABLE IF EXISTS wordTmp; 
    CREATE TEMPORARY TABLE wordTmp (word VARCHAR(255)); 

    SET @wordCt = 0; 
    SET @tokenCt = 1; 

    contentLoop: LOOP 
     SET @stmt = 'INSERT INTO wordTmp SELECT REPLACE(SUBSTRING(SUBSTRING_INDEX(`text`, " ", ?), 
           LENGTH(SUBSTRING_INDEX(`text`, " ", ? -1)) + 1), 
           " ", "") word 
        FROM content 
        WHERE LENGTH(SUBSTRING_INDEX(`text`, " ", ?)) != LENGTH(`text`)'; 
     PREPARE cmd FROM @stmt; 
     EXECUTE cmd USING @tokenCt, @tokenCt, @tokenCt; 
     SELECT ROW_COUNT() INTO @wordCt; 
     DEALLOCATE PREPARE cmd; 
     IF (@wordCt = 0) THEN 
      LEAVE contentLoop; 
     ELSE 
      SET @tokenCt = @tokenCt + 1; 
     END IF; 
    END LOOP; 

    SELECT word, count(*) usageCount FROM wordTmp GROUP BY word ORDER BY usageCount DESC; 
END // 
DELIMITER ; 

CALL wordsUsed(); 

您可能要编写另一个查询(或程序),或添加一些嵌套的“替换”对账单的话所产生的临时表,以进一步去除标点符号,但这应该是一个好的开始。

相关问题