用python构造的Spark结构

我试图用Kafka和Python来引发结构化的流媒体。要求：我需要在Spark中处理来自Kafka（采用JSON格式）的流数据（执行转换），然后将其存储在数据库中。用python构造的Spark结构

我有JSON格式，如数据， {"a": 120.56, "b": 143.6865998138807, "name": "niks", "time": "2012-12-01 00:00:09"}

我打算使用spark.readStream从卡夫卡喜欢读书，

data = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe","test").load()

我提到this link供参考，但没有得到如何解析JSON数据。我试过这个，

data = data.selectExpr("CAST(a AS FLOAT)","CAST(b as FLOAT)", "CAST(name as STRING)", "CAST(time as STRING)").as[(Float, Float, String, String)]

但看起来不起作用。

任何人谁已经与python火花结构化流工作指导我进行示例或链接？

使用，

schema = StructType([ 
    StructField("a", DoubleType()), 
    StructField("b", DoubleType()), 
    StructField("name", StringType()), 
    StructField("time", TimestampType())]) 

inData = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe","test").load() 
data = inData.select(from_json(col("value").cast("string"), schema)) 
query = data.writeStream.outputMode("Append").format("console").start()

程序运行，但我作为控制台获取值，

+-----------------------------------+ 
|jsontostruct(CAST(value AS STRING))| 
+-----------------------------------+ 
|    [null,null,null,2...| 
|    [null,null,null,2...| 
+-----------------------------------+ 

17/04/07 19:23:15 INFO StreamExecution: Streaming query made progress: { 
    "id" : "8e2355cb-0fd3-4233-89d8-34a855256b1e", 
    "runId" : "9fc462e0-385a-4b05-97ed-8093dc6ef37b", 
    "name" : null, 
    "timestamp" : "2017-04-07T19:23:15.013Z", 
    "numInputRows" : 2, 
    "inputRowsPerSecond" : 125.0, 
    "processedRowsPerSecond" : 12.269938650306749, 
    "durationMs" : { 
    "addBatch" : 112, 
    "getBatch" : 8, 
    "getOffset" : 2, 
    "queryPlanning" : 4, 
    "triggerExecution" : 163, 
    "walCommit" : 26 
    }, 
    "eventTime" : { 
    "watermark" : "1970-01-01T00:00:00.000Z" 
    }, 
    "stateOperators" : [ ], 
    "sources" : [ { 
    "description" : "KafkaSource[Subscribe[test]]", 
    "startOffset" : { 
     "test" : { 
     "0" : 366 
     } 
    }, 
    "endOffset" : { 
     "test" : { 
     "0" : 368 
     } 
    }, 
    "numInputRows" : 2, 
    "inputRowsPerSecond" : 125.0, 
    "processedRowsPerSecond" : 12.269938650306749 
    } ], 
    "sink" : { 
    "description" : "[email protected]" 
    } 
}

难道我错过这里的东西。后根据您的需要

from pyspark.sql.functions import get_json_object 

data.select([ 
    get_json_object(col("value").cast("string"), "$.{}".format(c)).alias(c) 
    for c in ["a", "b", "name", "time"]])

和cast他们：

来源

2017-04-07 user3150037

您可以使用from_json与架构：

from pyspark.sql.functions import col, from_json 
from pyspark.sql.types import * 

schema = StructType([ 
    StructField("a", DoubleType()), 
    StructField("b", DoubleType()), 
    StructField("name", StringType()), 
    StructField("time", TimestampType())]) 

data.select(from_json(col("value").cast("string"), schema))

或获得单个字段作为字符串与get_json_object。

来源

2017-04-07 18:34:20 user6910411

用python构造的Spark结构

回答

相关问题