1) Define BigData and what are the five V’s of Big Data?
Big Data is a term used to explain a set of information that is large in length and yet growing exponentially with time. In short, such statistics is so massive and complicated that not one of the traditional facts management tools are able to store it or system it successfully.
The below are the 5 V’s of Big Data:
Big Data Example:
Take Social Media Impact as an example
Statistic indicates that 500+terabytes of new records get ingested into the databases of social media site Facebook, every day. This statistics is especially generated in terms of photograph and video uploads, message exchanges, placing remarks and so forth.
2) What is Hadoop and its additives?
Apache Hadoop evolved as a solution to BigData when it emerges as a problem. Apache Hadoop is a framework which gives us diverse services or tools to store and process Big Data. It facilitates in reading Big Data and making enterprise selections out of it, that may not be accomplished successfully and correctly using conventional systems.
- Storage unit– HDFS (NameNode, DataNode)
- Processing framework– YARN (ResourceManager, NodeManager)
Want to become professionally successful in Big Data? Enroll for Big Data Training
3) What are HDFS and YARN?
HDFS (Hadoop Distributed File System) is nothing but storage unit of Hadoop. It is liable for storing distinctive kinds of records as blocks in a distributed environment. It follows master and slave topology.
- NameNode: NameNode is the master node in the distributed surroundings and it continues the metadata records for the blocks of facts stored in HDFS like block location, replication factors and so forth.
- DataNode: DataNodes are the slave nodes that are liable for storing facts in the HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages sources and offers an execution environment to the procedures.
- ResourceManager: It receives the processing requests, after which passes the elements of requests to corresponding NodeManagers accordingly, in which the actual processing takes area. It allocates assets to programs based totally at the wishes.
- NodeManager: NodeManager is hooked up on each DataNode and it's far accountable for the execution of the task on every single DataNode.
4) What are the various Hadoop daemons and their roles in a Hadoop cluster?
At first, refer the following various HDFS daemons:
- DataNode and
- Secondary NameNode
And then moving to YARN daemons:
- ResourceManager and
- NodeManager and
- Finally about JobHistoryServer.
NameNode: A master node that's accountable for storing the metadata of all the documents and directories. It has data approximately blocks, that make a report, and wherein those blocks are located within the cluster.
DataNode: It is the slave node that consists of the actual facts.
Secondary NameNode: It periodically merges the changes with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into chronic storage, which may be used in case of failure of NameNode.
ResourceManager: It is the principal authority that manages resources and agenda applications running on a pinnacle of YARN.
NodeManager: It runs on slave machines, and is chargeable for launching the application’s boxes, monitoring their aid utilization (CPU, memory, disk, network) and reporting those to the ResourceManager.
JobHistoryServer: It is responsible for maintaining information about MapReduce jobs after the Application Master terminates.
5) What are active and passive “NameNodes”?
In HA (High Availability) architecture, we've got two NameNodes – Active “NameNode” and Passive “NameNode”.
- Active “NameNode” is the “NameNode” which works and runs within the cluster.
- Passive NameNode is a standby “NameNode”, which has similar statistics as active NameNode.
The passive “NameNode” replaces the active “NameNode” in the cluster when the active NameNode fails. Hence, the cluster is in no way without a “NameNode” and so it by no means fails.
6) What are the modes that Hadoop can be run in?
Hadoop can run in 3 modes:
- Standalone Mode: It is the default mode of Hadoop, where it makes use of the local file system for input and output operations. This mode is particularly used for debugging cause, and it doesn’t support the use of HDFS. Further, in this mode, there's no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much quicker whilst compared to other modes.
- Pseudo-Distributed Mode (SingleNode Cluster): In this mode, you need configuration for all the 3 documents cited above. In this case, all daemons are going for walks on one node and thus, each Master and Slave node are identical.
- Fully Distributed Mode (Multiple Cluster Node): This is the manufacturing phase of Hadoop wherein statistics are used and dispensed across numerous nodes on a Hadoop cluster. Separate nodes are allocated as Master and Slave.
7) Mention the most common Input Formats in Hadoop?
The three most common Input Formats in Hadoop are as follows:
- Text Input Format: It is the Default input layout in Hadoop.
- Key Value Input Format: This format is used for plain text files in which the documents are broken into lines.
- Sequence File Input Format: This format is used for studying documents in sequence.
8) What are the methods of a Reducer?
The 3 core methods of a Reducer are:
- Setup(): To configure the numerous parameters like entering data length and disbursed cache, the Setup () method is used.
- Public void setup (context)
- Reduce(): This method usually called once consistent with a key with the associated decreased mission.
- Public void reduce (Key, Value, context)
- Cleanup(): This method is called to clean the temporary files, simplest as soon as at the cease of the mission.
- Public void cleanup (context)
9) Name a few companies that use Hadoop?
Yahoo (One of the largest person & greater than 80% code contributor to Hadoop)
10) What is the port number for NameNode, Task Tracker and Job Tracker?
The following are the port numbers for NameNode, Task Tracker and Job tracker:
- NameNode: The port number for NameNode is 50070
- Job Tracker: The port number for Job Tracker is 50030
- Task Tracker: The port number for Task Tracker is 50060
11) Whenever a client submits a Hadoop job, who receives it?
NameNode receives the Hadoop job which then appears for the records requested with the aid of the consumer and gives the block statistics. Job Tracker looks after resource allocation of the Hadoop job to ensure well-timed completion.
12) Explain about the different catalog tables in HBase?
There are two different catalog tables in HBase. They are:
ROOT table: It tracks where the META table is.
META table: It stores all the regions in the system.
13) What are the different types of tombstone markers in HBase for deletion?
There are 3 different types of tombstone markers in HBase for deletion. They are as follows:
- Family Delete Marker- This type of marker marks all columns for a column family.
- Version Delete Marker-This kind of marker marks a single version of a column.
- Column Delete Marker-This marker marks all the versions of a column.
14) Differentiate between Structured and Unstructured data?
Structured Data: The Data which is stored in traditional database systems in the form of rows and columns, for example, the online purchase transactions can be referred to as Structured Data.
Semi-Structured Data: Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi-structured data.
Unstructured Data: Unorganized and raw data that cannot be categorized as semi-structured or structured data is referred to as unstructured data.
Examples of Unstructured data: Facebook updates, Tweets on Twitter, Reviews, weblogs, etc.
15) What are the main components of a Hadoop Application?
Hadoop applications have a wide range of technologies that provide a great advantage in solving complex business problems.
Core components of a Hadoop application are-
1) Hadoop Common
3) Hadoop MapReduce
The Data Access Components are - Pig and Hive
The Data Storage Component is - HBase
The Data Integration Components are - Apache Flume, Sqoop, Chukwa
The Data Management and Monitoring Components are - Ambari, Oozie, and Zookeeper.
The Data Serialization Components are - Thrift and Avro
The Data Intelligence Components are - Apache Mahout and Drill.
16) What are the steps involved in deploying a big data solution?
There are three steps involved in deploying a big data solution:
- i) Data Ingestion – The main step in deploying big data solutions is to extract statistics from unique resources which will be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel, RDBMS like MySQL or Oracle, or could be the log documents, flat documents, files, pictures, social media feeds. This statistics desires to be saved in HDFS. Data can both be ingested through batch jobs that run every 15 mins, as soon as each night and so on or thru streaming in actual-time from a 100 ms to 120 seconds.
- ii) Data Storage – The next step after ingesting records is to keep it both in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access while HDFS is optimized for sequential get right of entry to.
iii) Data Processing – The ultimate step is to process the statistics using one of the processing frameworks like MapReduce, spark, pig, hive, and so forth.
17) How do you define “block” in HDFS? What is the default block size of Hadoop 1 and in Hadoop 2? Can it be changed?
- A) Block is defined as the smallest location on the hard drive where data is stored. HDFS stores each as a block, and distribute it across the Hadoop Cluster. Files in HDFS are chopped into block-sized chunks, which are stored as independent units.
- Default Block size of Hadoop 1: 64MB
- Default Block size of Hadoop 2: 128MB
Yes, blocks can be changed.
18) How do you outline “Rack Awareness” in Hadoop?
Rack Awareness is the algorithm in which the “NameNode” makes a decision how blocks and their replicas are placed based on rack definitions to minimize community visitors among “DataNodes” inside the identical rack. Let’s say we remember replication factor 3 (default), the coverage is that “for every block of records, two copies will exist in a single rack, third copy in a different rack”. This rule is referred to as the “Replica Placement Policy”.
"YOUR DREAM JOB IS AWAITING YOU" ENROLL NOW for Bigdata Hadoop Training