Hive vs Pig

Pig Hive
Procedural Data Flow Language Declarative SQLish Language
For Programming For creating reports
Mainly used by Researchers and Programmers Mainly used by Data Analysts
Operates on the client side of a cluster. Operates on the server side of a cluster.
Does not have a dedicated metadata database. Makes use of exact variation of dedicated SQL DDL language by defining tables beforehand.
Pig is SQL like but varies to a great extent. Directly leverages SQL and is easy to learn for database experts.
Pig supports Avro file format. Hive does not support it.
Read More

Hadoop YARN

What is YARN?

YARN stands for Yet Another Resource Negotiator. It is a generic resource platform for managing resources in a cluster. YARN was introduced with Hadoop 2.0, an open source distributed processing framework from Apache.

Read More

Apache Spark

What is Spark?

Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. It is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Read More

Machine Learning Methods

Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way

Read More