![]() ![]() After all we have the technology we can make our ETL… Better… Stronger… Faster Sure, we could have just slowed ingestion down but being snake people that just won’t do. Because we must write to many partitions at once sending events to the Kite Dataset sink with so many Flume agents caused our Hive Metastore to become unstable not only limiting the rate of our ingestion but also causing numerous query failures for our growth explorers. ![]() Not only that, but we have a lot of servers in Cloudera Manager so we can just add our backfilling Flume agent role to a bunch of them and we’ll be done in no time. Fortunately, we persist our Kafka topic of important logs for about a month so we can just replay it then merge and dedupe the data in Hive. This process worked very well until a couple days ago when some network issues and some bugs caused OOM errors which resulted in our Flume agent sporadically loosing all events buffered in memory. To get around this we simply run a compaction job with Oozie after some date cutoff for events. However due to the asynchronous nature of our event collection we end up having to write to multiple partitions at the same time which results in the formation of many small files. As an added bonus, because we are using Kite Datasets and the accompanying Flume sink Hive partitions are handled for us automatically. With this ETL data is available for querying almost immediately and is stored in (close to) their final state within Hive tables.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |