Posts

Showing posts from July, 2021

Why wait and notify methods are there in Object instead of Thread class?

Wait and notify is not just normal methods or synchronization utility, more than that they are communication mechanism between two threads in Java . And Object class is the correct place to make them available for every Object if this mechanism is not available via any java keyword like synchronized. Synchronized is to provide mutual exclusion and ensuring thread safety of Java class like race condition while wait and notify are communication mechanism between two thread Locks are made available on per Object basis , which is another reason wait and notify is declared in Object class rather then Thread class. In Java in order to enter a critical section of code, Threads needs lock and they wait for lock, they don’t know which threads hold lock instead they just know the lock is hold by some thread and they should wait for lock instead of knowing which thread is inside the synchronized block and asking them to release lock. this analogy fits with wait and notify being on object class r

Kafka interview questions

How to Send Large Messages in Kafka ? Kafka Issue: Many time while trying to send  large  messages over Kafka it errors out with an exception – “MessageSizeTooLargeException”. These mostly occurs on the Producer side. Fixes: There are couple of configuration properties , you can try making changes and see it that works. Producer:  Increase max.request.size to send the larger message. Kafka Broker side : replica.fetch.max.bytes – Size used to replicate the messages within the brokers as an practice of Replication.  The Brokers obey this size as a rule to create replicas of a message.  If this size is set as very small, the  message  is not replicated(fully committed). Hence message will not be committed. And so the consumer will not see the message. Broker side : message.max.bytes – Maximum size of the message that a Broker can receive from a producer. Try setting this size on a higher side and see if the error exception goes away. Broker side  (per topic): max.message.bytes – Maximum s

Role of Tachyon in Spark architecture

Image
Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Getting Started with Apache Spark 2 Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for dif

What is Apache Spark and it's life cycle with components in details

Image
Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark’s development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Getting Started with Apache Spark 2 Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for dif

Getting started with Spark's Word Count program by Scala

Hadoop MapReduce’s word count becomes very simple with the Spark shell. In this recipe, we are going to create a simple 1-line text file, upload it to the Hadoop distributed file system (HDFS), and use Spark to count occurrences of words. Let’s see how: 1. Create the words directory by using the following command: $ mkdir words 2. Get into the words directory: $ cd words 3. Create a sh.txt text file and enter “to be or not to be” in it: $ echo “to be or not to be” > sh.txt 4. Start the Spark shell: $ spark-shell 5. Load the words directory as RDD: Scala> val words = sc.textFile(“hdfs://localhost:9000/user/hduser/ words”) 6. Count the number of lines ( result: 1): Scala> words.count 7. Divide the line (or lines) into multiple words: Scala> val wordsFlatMap = words.flatMap(_.split(“\\W+”)) 8. Convert word to (word,1)—that is, output 1 as the value for each occurrence of word as a key: Scala> val wordsMap = wordsFlatMap.map( w => (w,1)) 9. Use the reduceByKey method to a

How does Spark decide the number of tasks and number of tasks to execute in parallel?

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job. Let’s see how Spark decides on the number of tasks with the below set of instructions. READ dataset_X FILTER on dataset_X MAP operation on dataset_X READ dataset_Y MAP operation on dataset_Y JOIN dataset_X and dataset_Y FILTER on joined dataset SAVE the output Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions. Stages and number of tasks per stage Spark will create 3 stages – First stage  – Instructions 1, 2 and 3 Second stage  – Instructions 4 and 5 Third stage  – Instructions 6, 7 and 8 Number of tasks in first stage First stage reads dataset_X and dataset_X has 10 partitions. So stage 1 will result in 10 tasks. If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create. The default for defaul