Posts

Identify use cases from any industries of your choice and elaborate on how Big Data analytics can be used to transform those businesses.

Telecoms :   The telecom industry worldwide is finding itself in a highly complex environment of decreasing margins and congested networks; an environment that is as cutthroat as ever. A new IBM study on how telcos are using Big Data shows that 85% of the respondents indicate that the use of information and analytics is creating a competitive advantage for them. Big data initiatives promise to improve growth and increase efficiency and profitability across the entire telecom value chain. Yes, Big Data to the rescue again! The potential of Big Data, however, poses a challenge: how can a company utilize data to increase revenues and profits across the value chain, spanning network operations, product development, marketing, sales, and customer service. Big Data analytics, for instance, enables companies to predict peak network usage so that they can take measures to relieve congestion. It can also help identify customers who are most likely to have problems paying bills as well as t...

Spark SQL Built-in Standard Functions

Spark SQL String Functions String functions are grouped as “ string_funcs” in spark SQL. Below is a list of functions defined under this group. Click on each link to learn with a Scala example. STRING FUNCTION SIGNATURE STRING FUNCTION DESCRIPTION ascii(e: Column): Column Computes the numeric value of the first character of the string column, and returns the result as an int column. base64(e: Column): Column Computes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64. concat_ws(sep: String, exprs: Column*): Column Concatenates multiple input string columns together into a single string column, using the given separator. decode(value: Column, charset: String): Column Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). encode(value: Column, charset: St...

What is Garbage collection in Spark and its impact and resolution

Image
Garbage Collection Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). Therefore, garbage collection  (GC) can be a major issue that can affect many Spark applications. Common symptoms of excessive GC in Spark are: Slowness of application Executor heartbeat timeout GC overhead limit exceeded error Spark’s memory-centric approach and data-intensive applications make it a more common issue than other Java applications. Thankfully, it’s easy to diagnose if your Spark application is suffering from a GC problem. The Spark UI marks executors in red if they have spent too much time doing GC. Spark executors are spending a significant amount of CPU cycles performing garbage collection. This can be determined by looking at the “Executors” tab in the Spark application UI. Spark will mark an executor in  red  if the executor has spent more than 10% of the time in gar...

What is data skew in spark how do we deal with?

Image
What is Data Skew? In an ideal Spark application run, when Spark wants to perform a join, for example, join keys would be evenly distributed and each partition would get nicely organized to process. However, real business data is rarely so neat and cooperative. We often end up with less than ideal data organization across the Spark cluster that results in degraded performance due to  data skew . Data skew is not an issue with Spark per se, rather it is a data problem. The cause of the data skew problem is the uneven distribution of the underlying data. Uneven partitioning is sometimes unavoidable in the overall data layout or the nature of the query. For joins and aggregations Spark needs to co-locate records of a single key in a single partition.  Records of a key will always be in a single partition. Similarly, other key records will be distributed in other partitions. If a single partition becomes very large it will cause data skew, which  will be problematic for any query engine if...

Role of Spark optimizer, Deep dive into CATALYST optimizer

Image
The Catalyst optimizer is a crucial component of Apache Spark. It optimizes structural queries – expressed in SQL, or via the DataFrame/Dataset APIs – which can reduce the runtime of programs and save costs. Developers often treat Catalyst as a black box that just magically works. Moreover, only a handful of resources are available that explain its inner workings in an accessible manner. When discussing Catalyst, many presentations and articles reference this architecture diagram, which was included in the original blog post from Databricks that introduced Catalyst to a wider audience. The diagram has inadequately depicted the physical planning phase, and there is an ambiguity about what kind of optimizations are applied at which point. The following sections explain Catalyst’s logic, the optimizations it performs, its most crucial constructs, and the plans it generates. In particular, the scope of cost-based optimizations is outlined. These were advertised as “state-of-art” when the f...

Spark challenges faced in current project

Image
Many Spark challenges relate to configuration, including the number of executors to assign, memory usage (at the driver level, and per executor), and what kind of hardware/machine instances to use. You make configuration choices per job, and also for the overall cluster in which jobs run, and these are interdependent – so things get complicated, fast. Some challenges occur at the job level; these challenges are shared right across the data team. They include: 1. How many executors should each job use? 2. How much memory should I allocate for each job? 3. How do I find and eliminate data skew? 4. How do I make my pipelines work better? 5. How do I know if a specific job is optimized? Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include: 6. How do I size my nodes, and match them to the right servers/instance types? 7. How do I se...