Flume to kafka

8/14/2023

After all we have the technology we can make our ETL… Better… Stronger… Faster Sure, we could have just slowed ingestion down but being snake people that just won’t do. Because we must write to many partitions at once sending events to the Kite Dataset sink with so many Flume agents caused our Hive Metastore to become unstable not only limiting the rate of our ingestion but also causing numerous query failures for our growth explorers.

Not only that, but we have a lot of servers in Cloudera Manager so we can just add our backfilling Flume agent role to a bunch of them and we’ll be done in no time. Fortunately, we persist our Kafka topic of important logs for about a month so we can just replay it then merge and dedupe the data in Hive. This process worked very well until a couple days ago when some network issues and some bugs caused OOM errors which resulted in our Flume agent sporadically loosing all events buffered in memory. To get around this we simply run a compaction job with Oozie after some date cutoff for events. However due to the asynchronous nature of our event collection we end up having to write to multiple partitions at the same time which results in the formation of many small files. As an added bonus, because we are using Kite Datasets and the accompanying Flume sink Hive partitions are handled for us automatically. With this ETL data is available for querying almost immediately and is stored in (close to) their final state within Hive tables.

Kite then handles interacting with Hive and persisting the events in HDFS.
Events are buffered via a memory channel and sent to the Kite Dataset sink.
The Flume Morphline interceptor is then used to do a series of transformations including annotating what type of event the log line represented.
A Flume agent then uses the Kafka source to pull from the appropriate topic.Raw logs are aggregated and processed by a custom parser which functions as both an aggregator for high level stats as well as emitter of raw logs into the various Kafka topics.The initial design of our ETL pipeline looks something like this: Knowing this we store our event data in Hive partitioned by when the events occurred rather then when they are ingested. But when querying this data we are likely to care more about when the events occurred rather then when it was sent to our servers. At we process a lot of events including some some events that are batched and sent asynchronously sometimes days later.

0 Comments

Flume to kafka

Leave a Reply.

Author

Archives

Categories