Posts

Apache Kafka®, Kafka Streams, and ksqlDB to demonstrate real use cases - 3 (message-ordering)

How to maintain message ordering and no message duplication Question: How can I maintain the order of messages and prevent message duplication in a Kafka topic partition? Example use case: If your application needs to maintain ordering of messages with no duplication, you can enable your Apache Kafka producer for idempotency. An idempotent producer has a unique producer ID and uses sequence IDs for each message, which allows the broker to ensure it is committing ordered messages with no duplication, on a per partition basis. Short Answer Set the  ProducerConfig  configuration parameters relevant to the idempotent producer: enable.idempotence= true acks=all Initialize the project   mkdir message-ordering && cd message-ordering Make the following directories to set up its structure: mkdir src test 2 Get Confluent Platform Next, create the following  docker-compose.yml  file to obtain Confluent Platform (this tutorial uses just ZooKeeper and the Kafka broker): ---version: '2...

Apache Kafka®, Kafka Streams, and ksqlDB to demonstrate real use cases -2 (Primitive keys and values)

How to use the console consumer to read non-string primitive keys and values Question: How do I specify key and value deserializers when running the Kafka console consumer? Example use case: You want to inspect/debug records written to a topic. Each record key and value is a long and double, respectively. In this tutorial you’ll learn how to specify key and value deserializers with the console consumer. Initialize the project To get started, make a new directory anywhere you’d like for this project: mkdir console-consumer-primitive-keys-values && cd console-consumer-primitive-keys-values 2 Get Confluent Platform Next, create the following  docker-compose.yml  file to obtain Confluent Platform. ---version: ' 2 'services: zookeeper: image: confluentinc/cp-zookeeper: 6.1 .0 hostname: zookeeper container_name: zookeeper ports: - "2181:2181" environment: ZOOKEEPER_CLIENT_PORT : 2181 ZOOKEEPER_TICK_TIME : 2000 broker: image: co...

Apache Kafka®, Kafka Streams, and ksqlDB to demonstrate real use cases-1 (Basic CLI commands)

Produce and Consume:   Console Producer and Consumer Basics, no (de)serializers Question: What is the simplest way to write messages to and read messages from Kafka? Example use case: So you are excited to get started with Kafka and you’d like to produce and consume some basic messages and you want to do so quickly. In this tutorial we’ll show you how to produce and consume messages from the command line with no code! Short Answer Console producer: kafka- console -producer --topic example-topic --bootstrap-server broker: 9092 \ --property parse.key= true \ --property key.separator= ":" Console consumer: kafka- console -consumer --topic example-topic --bootstrap-server broker: 9092 \ -- from -beginning \ --property print .key= true \ --property key.separator= "-" Initialize the project To get started, make a new directory anywhere you’d like for this project: mkdir console-consumer-producer-basic && cd console-consumer-producer-basic 2 Get Confluent Pla...

Apache KSQL (Kafka Sql Streaming) tutorial

K SQL is a SQL streaming engine for Apache Kafka. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language like Java or Python. KSQL is scalable, elastic, and fault-tolerant. It supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization. We have been writing a lot of code so far to consume a stream of data from Kafka, using Kafka’s Java client, Kafka Streams, or Spark. In this post, we will see how we can use KSQL to achieve similar results. KSQL is a SQL engine for Kafka. It allows you to write SQL queries to analyze a stream of data in real time. Since a stream is an  unbounded data set  (for more details about this terminology, see  Tyler Akidau’s posts ), a query with KSQL will keep generating results until you stop it. KSQL is built on top of Kafka Streams. When you submit a query, this query will be pa...

All about Broadcast Variable and How to use it

In this post , we will see – How to use Broadcast Variable in Spark . Broadcast variables can be tricky if the concepts behind are not clearly understood. This creates errors while using any Broadcast variables down the line. Broadcast variables are used to implement map-side join, i.e. a join using a map. e.g.. Lookup tables or data are distributed across nodes in a Distributed cluster using broadcast . And they are then used inside map (to do the join implicitly). When you broadcast some data , the data gets copied to All the executors only once (So we avoid copying the same data again & again for tasks otherwise). Hence the broadcast  makes  your Spark application faster when you have a large value to use in tasks or there are more no. of tasks than executors. To use any Broadcast variables correctly , note the below points and cross-check against your  usage  . Broadcast Type errors – A broadcast variable is not necessarily an RDD or a Collection. It’s just whatever type you as...