对表进行分区

Bigquery目前只允许按日期进行分区。对表进行分区

Lets supose我有一个与inserted_timestamp字段的10亿表行。让我们说这个领域有1年前的日期。

将现有数据移动到新的分区表的正确方法是什么？

编辑

我看到有Java的一个优雅的解决方案与版本< 2.0 Sharding BigQuery output tables还阐述了在BigQuery partitioning with Beam streams那就是参数化窗口数据表名（或分区后缀）。

但是我错过了012x在2.x光束项目也没有关于从python序列化函数获取窗口时间的示例。

我试图在管道上进行分区，但是如果大量分区失败（运行100，但失败1000）。

这是我的代码，就我：

   ( p 
       | 'lectura' >> beam.io.ReadFromText(input_table) 
       | 'noheaders' >> beam.Filter(lambda s: s[0].isdigit()) 
       | 'addtimestamp' >> beam.ParDo(AddTimestampDoFn()) 
       | 'window' >> beam.WindowInto(beam.window.FixedWindows(60)) 
       | 'table2row' >> beam.Map(to_table_row) 
       | 'write2table' >> beam.io.Write(beam.io.BigQuerySink(
         output_table, #<-- unable to parametrize by window 
         dataset=my_dataset, 
         project=project, 
         schema='dia:DATE, classe:STRING, cp:STRING, import:FLOAT', 
         create_disposition=CREATE_IF_NEEDED, 
         write_disposition=WRITE_TRUNCATE, 
            ) 
           ) 
       ) 

p.run()

来源

2017-10-13 danihp

https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables应该是相关几个方法。此外，我认为你应该能够使用JSON或AVRO而不是CSV来避免使用平面文件。 –

@NhanNguyen，刚刚编辑我的问题更具体。在<2.0存在一个优雅的解决方案，我错过了> 2.x。感谢你的链接，我跟着它，是非常相关的问题。再次感谢。 – danihp

所有必要做这个存在于梁的功能，尽管它目前可能仅限于Java SDK中。您可以使用BigQueryIO。具体而言，您可以使用DynamicDestinations来确定每行的目标表。

从DynamicDestinations的例子：

events.apply(BigQueryIO.<UserEvent>write() 
    .to(new DynamicDestinations<UserEvent, String>() { 
     public String getDestination(ValueInSingleWindow<String> element) { 
      return element.getValue().getUserId(); 
     } 
     public TableDestination getTable(String user) { 
      return new TableDestination(tableForUser(user), 
      "Table for user " + user); 
     } 
     public TableSchema getSchema(String user) { 
      return tableSchemaForUser(user); 
     } 
     }) 
    .withFormatFunction(new SerializableFunction<UserEvent, TableRow>() { 
    public TableRow apply(UserEvent event) { 
     return convertUserEventToTableRow(event); 
    } 
    }));

来源

2017-10-16 20:40:13

为什么他们不是一个python包装来做到这一点？我应该用Java代替python来支付数据流项目吗？你知道Google是否在提供Java资源吗？我的意思是，如果我使用Python工作，我会错过比这个更多的功能？谢谢！ – danihp

正如这演示的那样，Java和Python SDK之间有不同的功能。解决这些差距是Apache Beam正在进行的努力的一部分。这个特定问题被追踪为[BEAM-2801]（https://issues.apache.org/jira/browse/BEAM-2801）。 –

对表进行分区

回答

相关问题