Skip to content Skip to sidebar Skip to footer

Google Dataflow: Insert + Update In Bigquery In A Streaming Pipeline

The main object A python streaming pipeline in which I read the input from pub/sub. After the input is analyzed, two option are available: If x=1 -> insert If x=2 -> update

Solution 1:

Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.

Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.

As an example, let's assume an arbitrary schema: idfield is your unique entity key/identifier, and message_time represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):

#standardSQL
SELECT
  idfield,
  ARRAY_AGG(
    t ORDERBY message_time DESC LIMIT 1
  )[OFFSET(0)].*EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUPBY idfield

An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

Post a Comment for "Google Dataflow: Insert + Update In Bigquery In A Streaming Pipeline"