Google Dataflow: Insert + Update In Bigquery In A Streaming Pipeline
Solution 1:
Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE
. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.
Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.
As an example, let's assume an arbitrary schema: idfield
is your unique entity key/identifier, and message_time
represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):
#standardSQL
SELECT
idfield,
ARRAY_AGG(
t ORDERBY message_time DESC LIMIT 1
)[OFFSET(0)].*EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUPBY idfield
An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
Post a Comment for "Google Dataflow: Insert + Update In Bigquery In A Streaming Pipeline"