1. What is data cleansing?
Data cleansing or Data cleaning enhances the quality of data by identifying the errors, inconsistencies from the data and removing them.
2. Data cleaning has a major role in data analysis, explain.
Cleaning data which is taken from many different sources and arranging it in a format for the easy use of any data scientist or data analyst is a difficult process. With the increase in the volume of the data generated and the number of data sources, the time that takes to clean the data also increases exponentially. Since cleaning takes the major part of the time, it has become a major part of an analysis task.
3. Which among the two would you prefer for text analysis, Python or R?
Because of the Pandas library, which provides easy to use data structures, python is mostly preferred. But, when it comes to ad-hoc analysis and exploring databases, R plays better.
4. What is logistic regression?
It is a statistical method. It is used to examine the dataset where the outcome is defined by one or more independent variables.
5. Differentiate between Univariate, Bivariate and Multivariate analysis.
These are the techniques used for statistical analysis. Based on the number of variables involved at one time, these are differentiated.
If only one variable is sufficient to analyze, for example, a sales pie chart, then it is called univariate analysis.
If an analysis requires two variables in a scatter plot to understand the difference, then it is called bivariate analysis. For example, sale and spend analysis can be considered under it.
If analysis involves more than two variables to understand, then it is called multivariate analysis.
6. Explain the steps in making a decision tree.
- Firstly, as an input, take the entire data set.
- Look for a split (which divides the given data into two sets) that maximizes the separation of the classes.
- Then divide or split the input data.
- To the split data, re-apply the steps 1 to 2.
- Stop the process, when you meet any stopping criteria.
- If you split for many times, clean up the tree. This step is called pruning.
7. Compare SAS, R, and Python programming?
SAS: This is one of the popularly used analytical tools by some of the big companies. It has some of the best in the world statistical functions, GUI but has a price tag because of which the usage by small companies drops.
R: The drawback of SAS is covered here i.e. it is an open source tool. This could be the reason for the generous use of R by academia and research community. R is mostly used for statistical computation, reporting, and graphical representation. Since it is an open source tool, the updates would reach the users immediately.
Python: Python is also an open source programming language. It is one of the easy programming languages you can learn. It can be integrated easily with most of the other tools and technologies. It is a very robust language with innumerable libraries and many other modules created by the community.
8. Difference between data mining and data profiling
Data profiling: It targets individual attributes and gives information on various attributes like discrete value and their frequency, data type, length, value range etc.
Data mining: Data mining targets on detection of unusual records, cluster analysis, dependencies or relations between several different attributes, etc.
9. How the statistics are used by Data Scientists?
Statistics come as a great use for Data Scientists in the form of identifying hidden insights, patterns, and converting Big Data into Big insights that helps to see the customer behavior and expectations. This helps the Data Scientists learn right from the customer behavior to the customer conversion and helps them build powerful data models for certain interferences and predictions. This way they help businesses and customers by giving them what they want and when they want.
10. What are some of the common problems faced by data analyst?
Following are some most common +problems a data analyst face.
- Common misspelling
- Duplicate entries
- Missing values
- Illegal values
- Varying value representations
- Identifying overlapping data
11. What is the goal of A/B Testing?
This is an experimental statistical testing done with two variables A and B to identify which webpage performs better of the two tests A and B. For example, in case of running a banner Ad, it is used to identify the click-through rate.
12. What are the different data validation methods that data analyst use?
Usually, two methods are used by data analysts for data validation,
- Data screening
- Data verification
13. Explain the Hierarchical Clustering Algorithm?
Hierarchical clustering algorithm creates a hierarchical structure by combining and dividing existing groups, which show the order in which groups, are divided or merged.
14. Various steps in an analytics project
The first step is to understand what exactly the business problem is.
- Then explore the given data and become familiar with it.
- Next prepare the data for modeling.
- Now start running the model, analyze the result and iterate the step until the best outcome is achieved.
- Now, validate the model using a new data set.
- Next is the implementation of the model and tracking the result and analyzing the performance of the model.
15. What is Linear Regression?
It describes a relationship between a dependent variable and independent variable and also mostly used for predictive analysis like in case of sales, price etc. where it predicts the values which are in a continuous range rather than classifying into categories.
Following are the three methods of Linear Regression.
- Determining the direction and correlation of data and analyzing it.
- Deploying the estimation of the model
- To make sure the model is useful and has good validity
It is mostly used in cases where we determine the cause of the effect. For instance, with the linear regression, we can know the effect of a certain action in determining the various outcomes and on the final outcome.
16. What is Normal Distribution?
Normal Distribution which can be considered as a continuous probability distribution is a set of continuous variable spread across a normal curve. It is very useful in the statistics and in the analysis of the variables and their relationships.
This is a symmetrical curve and as the size of the samples increases, the non-normal distribution approaches the normal distribution. Central Limit Theorem can also be deployed very easily.
17. Explain what is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that divides the data set into clusters or natural groups.
Properties for clustering algorithm are
- Hierarchical or flat
- Hard and soft
18. What is Machine Learning?
The Machine Learning is a field of artificial intelligence (AI) where the systems will be given the ability to learn things automatically and make decisions with very less human intervention.
19. What is a hash table?
A Hashtable is a data structure which is used to implement an associative array. It stores data in an associative manner. We can also say, it is a map of keys to values. A Hashtable uses a hash function to compute an index into an array of slots and fetch the desired value.
20. Explain biasing types that occur during sampling?
- Selection bias
- Under coverage bias
- Survivorship bias