2014-03-27 30 views
1

我们正在运行Datastax企业4.0.1和插入行的卡珊德拉当运行到一个很奇怪的问题,然后为COUNT(1)查询蜂巢。DSE 4.0.1:比卡桑德拉不同的配置单元数计数

设置:DSE 4.0.01,Cassandra 2.0,Hive,全新群集。插入10,000行到卡桑德拉然后:

cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000; 

count 
------- 
10000 

(1 rows) 

cqlsh:pageviews> 

但是从蜂巢:

hive> select count(1) from pageviews_v1 limit 100000; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002 
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002 
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1 
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0% 
<snip> 
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec 
MapReduce Total cumulative CPU time: 11 seconds 310 msec 
Ended Job = job_201403272330_0002 
MapReduce Jobs Launched: 
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS 
Total MapReduce CPU Time Spent: 11 seconds 310 msec 
OK 
1723 
Time taken: 38.634 seconds, Fetched: 1 row(s) 

只有1723行。我很困惑。该CQL3的ColumnFamily定义是:

CREATE TABLE pageviews_v1 (
    website text, 
    date text, 
    created timestamp, 
    browser_id text, 
    ip text, 
    referer text, 
    user_agent text, 
    PRIMARY KEY ((website, date), created, browser_id) 
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND 
    bloom_filter_fp_chance=0.001000 AND 
    caching='KEYS_ONLY' AND 
    comment='' AND 
    dclocal_read_repair_chance=0.000000 AND 
    gc_grace_seconds=864000 AND 
    index_interval=128 AND 
    read_repair_chance=1.000000 AND 
    replicate_on_write='true' AND 
    populate_io_cache_on_flush='false' AND 
    default_time_to_live=0 AND 
    speculative_retry='NONE' AND 
    memtable_flush_period_in_ms=0 AND 
    compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND 
    compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'}; 

而且在蜂房有:

CREATE EXTERNAL TABLE pageviews_v1(
    website string COMMENT 'from deserializer', 
    date string COMMENT 'from deserializer', 
    created timestamp COMMENT 'from deserializer', 
    browser_id string COMMENT 'from deserializer', 
    ip string COMMENT 'from deserializer', 
    referer string COMMENT 'from deserializer', 
    user_agent string COMMENT 'from deserializer') 
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe' 
STORED BY 
    'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
WITH SERDEPROPERTIES (
    'serialization.format'='1', 
    'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua') 
LOCATION 
    'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1' 
TBLPROPERTIES (
    'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner', 
    'cassandra.ks.name'='pageviews', 
    'cassandra.cf.name'='pageviews_v1', 
    'auto_created'='true') 

任何人都经历相似?

回答

0

该问题似乎与群集ORDERY BY。删除可解决COUNT与Hive误报的问题。

0

这可能是根据this document在HIVE表上的一致性设置。

0

将配置单元查询更改为“从pageviews_v1中选择count(*);”