1. What is the role of a data analyst?
Answer: A data analyst collects, processes, and performs statistical analyses on large datasets to discover insights and support decision-making processes.
2. How do data analysts differ from data scientists?
3. What is data cleaning and why is it important?
Answer: Data cleaning involves correcting or removing incorrect, incomplete, duplicate, or irrelevant data. It’s essential because dirty data can lead to incorrect analysis and poor decisions.
4. What is the difference between descriptive and predictive analysis?
There are two distinct approaches to data analysis: descriptive analysis and predictive analysis.
Questions such as “What has happened in the past?” and “What are the key characteristics of the data?” are described via descriptive analysis. Finding the correlations, patterns, and trends in the data is its primary objective. To learn more about the dataset, it makes use of statistical metrics, visualizations, and exploratory data analysis methods. The following are the main features of descriptive analysis:
On the other hand, predictive analysis makes predictions about future events by utilizing statistical and machine learning models to analyze historical data and find patterns and linkages. Its main objective is to anticipate or foretell what is likely to occur in the future.
The following are the main attributes of predictive analysis:
5. What is univariate, bivariate, and multivariate analysis?
Answer:
1.Univariate analysis involves analyzing a single variable at a time. The purpose is to understand the distribution, central tendency (mean, median, mode), and spread (range, variance, standard deviation) of the variable.
Example:
Analyzing the average age of customers in a dataset.
df['Age'].describe()
Common visualizations:
Histogram
Box plot
Pie chart (for categorical data)
2. Bivariate Analysis
Definition:
Bivariate analysis examines the relationship between two variables. It helps identify correlations, dependencies, or patterns.
Example:
Studying the relationship between hours studied and exam scores.
Common methods:
Scatter plot
Correlation coefficient (e.g., Pearson, Spearman)
Cross-tabulations
Box plot grouped by category
Python Example:
df.plot.scatter(x='Hours_Studied', y='Exam_Score')
Definition:
Multivariate analysis involves analyzing three or more variables simultaneously to understand relationships, interactions, and influence among them.
Example:
Understanding how age, income, and education level together influence spending habits.
Common techniques:
Multiple regression
Principal Component Analysis (PCA)
Cluster analysis
Heatmaps
3D plots or pairplots (in seaborn)
Python Example:
import seaborn as sns
sns.pairplot(df[['Age', 'Income', 'Spending_Score']])
6. What is the importance of exploratory data analysis (EDA) in data analysis?
Answer:
The process of examining and comprehending data using statistical and graphical methods is known as exploratory data analysis, or EDA. It is one of the most important aspects of data analysis that aids in seeing trends and patterns in the data and in comprehending how variables relate to one another.
Because EDA is a non-parametric method of data analysis, it makes assumptions about the dataset. EDA is significant for several reasons, including the following:
7. How can pandas used for Data Analysis ?
Answer:
Pandas is a powerful open-source Python library that provides fast, flexible, and expressive data structures like Series and DataFrame for working with structured data.
Pandas is essential for:
Cleaning messy data
Filtering and transforming datasets
Performing descriptive statistics
Analyzing time-series data
Handling large datasets efficiently
You can load data from various formats:
In python,
import pandas as pd
# CSV
df = pd.read_csv('data.csv')
# Excel
df = pd.read_excel('data.xlsx')
# JSON
df = pd.read_json('data.json')
Check the first and last few rows:
In python,
df.head() # First 5 rows
df.tail() # Last 5 rows
df.info() # Summary of dataset
df.describe() # Descriptive statistics
Handling missing values
df.isnull().sum() # Check missing values
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill with 0
Removing duplicates
df.drop_duplicates(inplace=True)
Renaming columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
Selecting columns
df['ColumnName']
df[['Col1', 'Col2']]
Filtering rows
df[df['Age'] > 30]
df[(df['Age'] > 25)
& (df['Gender'] == 'Male')]
Useful for summarizing large datasets:
df.groupby('Department')['Salary'].mean()
df.groupby(['Department', 'Gender'])['Salary'].sum()
Sorting values
df.sort_values(by='Salary', ascending=False)
Setting index
df.set_index('EmployeeID', inplace=True)
You can create or modify columns:
df['Total'] = df['Price'] * df['Quantity']
df['Category'] = df['Sales'].apply(lambda x: 'High' if x > 1000 else 'Low')
Combining multiple datasets:
pd.merge(df1, df2, on='ID', how='inner') # inner, left, right, outer
df.pivot_table(index='Region', columns='Product', values='Sales', aggfunc='sum')
pd.crosstab(df['Gender'], df['Purchased'])
Pandas supports simple plotting using Matplotlib:
df['Sales'].plot(kind='line')
df['Category'].value_counts().plot(kind='bar')
df.plot.scatter(x='Profit', y='Sales')