Lecture Notes Of Class 7:
Data Manipulation with dplyr
Objective:
In this class, we will explore
data manipulation using the dplyr package in R. dplyr is part of the tidyverse
and is a powerful tool for working with data frames. It simplifies data
manipulation tasks such as filtering, selecting, arranging, mutating, and
summarizing data. We'll also learn how to chain multiple operations using the
pipe operator (%>%).
Topics
Covered:
1. Using filter()
The filter() function allows you to subset data based on certain conditions. It
helps to filter rows that meet a specific condition, making the data analysis
process easier.
Syntax:
r
CopyEdit
filter(data,
condition)
Example: Suppose
we have a data frame students:
r
CopyEdit
students <-
data.frame(
name = c("Alice", "Bob", "Charlie",
"David", "Eva"),
age = c(22, 23, 21, 24, 22),
grade = c("A", "B", "A",
"C", "B")
)
To filter students who are 22
years old:
r
CopyEdit
filter(students,
age == 22)
2. Using select()
The select() function allows you to choose specific columns from the data
frame. It is useful when you only want to focus on a subset of variables.
Syntax:
r
CopyEdit
select(data,
columns)
Example: To
select only the name and age columns from the students data frame:
r
CopyEdit
select(students,
name, age)
3. Using arrange()
The arrange() function allows you to reorder the rows of a data frame based on
one or more variables. It can be used for sorting data in ascending or
descending order.
Syntax:
r
CopyEdit
arrange(data,
column)
arrange(data,
desc(column)) # For descending order
Example: To
arrange the students by age in ascending order:
r
CopyEdit
arrange(students,
age)
4. Using mutate()
The mutate() function is used to create new variables or modify existing
variables. It is very useful when you want to apply transformations to your
data.
Syntax:
r
CopyEdit
mutate(data,
new_column = expression)
Example: To
create a new column called age_in_months by multiplying the age column by 12:
r
CopyEdit
mutate(students,
age_in_months = age * 12)
5. Using summarize()
(or summarise())
The summarize() function is used to create summary statistics of your data. It
is particularly useful for calculating aggregated data, such as the mean,
median, sum, etc.
Syntax:
r
CopyEdit
summarize(data,
new_column = summary_function(column))
Example: To
calculate the average age of the students:
r
CopyEdit
summarize(students,
average_age = mean(age))
6. Chaining
Operations with Pipes (%>%)
The pipe operator (%>%) is used to chain multiple operations together,
making the code more readable and easier to understand. It passes the result of
one function as an argument to the next function.
Syntax:
r
CopyEdit
data %>%
function1() %>% function2()
Example: To
filter students with age 22, select the name column, and arrange the result in
alphabetical order:
r
CopyEdit
students %>%
filter(age == 22) %>%
select(name) %>%
arrange(name)
7. Aggregating
Data with summarize()
Aggregating data means calculating summary statistics for groups of data. The summarize()
function can be used in combination with group_by() to calculate statistics for
specific groups within your data.
Syntax:
r
CopyEdit
group_by(data,
group_column) %>%
summarize(new_column = summary_function(column))
Example: To
calculate the average age for each grade in the students data frame:
r
CopyEdit
students %>%
group_by(grade) %>%
summarize(average_age = mean(age))
Practical
Exercises:
1. Exercise
1: Filter and Select Specific Columns from Data
o
Use the students data frame from the previous
examples.
o
Filter students whose age is greater than 22.
o
Select the name and grade columns.
Solution:
r
CopyEdit
students %>%
filter(age > 22) %>%
select(name, grade)
2. Exercise
2: Perform Data Aggregation Using summarize()
o
Group the students data by grade and calculate the
average age for each group.
Solution:
r
CopyEdit
students %>%
group_by(grade) %>%
summarize(average_age = mean(age))
Conclusion:
In this class, we introduced the
basics of data manipulation using the dplyr package. We learned how to filter,
select, arrange, mutate, and summarize data, and how to aggregate data using summarize().
The use of the pipe operator (%>%) allows for chaining multiple operations
together to streamline and simplify the code. By applying these functions, you
can easily manipulate and analyze your data in R.
These concepts are essential for
any data analysis task and will serve as the foundation for more advanced data
manipulation techniques.
4o mini
You said:
WRITE 10 ASSIGNMENTS FROM ABOVE
CONCEPT WITH STEP BY STEP SOLUTIONS with proper explanation in details
ChatGPT
said:
ChatGPT
Assignments:
Data Manipulation with dplyr (Step-by-Step Solutions)
Assignment
1: Filter Data Based on Multiple Conditions
Problem: Using
the students data frame, filter students who are older than 21 years and have a
grade of "A".
Step-by-Step Solution:
1. Load the
data:
r
CopyEdit
students <-
data.frame(
name = c("Alice", "Bob", "Charlie",
"David", "Eva"),
age = c(22, 23, 21, 24, 22),
grade = c("A", "B", "A",
"C", "B")
)
2. Use the filter()
function: We use filter() to subset data based on multiple conditions. The age
> 21 and grade == "A" conditions are specified.
r
CopyEdit
filter(students,
age > 21, grade == "A")
3. Expected
Output:
css
CopyEdit
name age grade
1
Alice 22 A
2 Charlie
21 A
Assignment
2: Select Specific Columns
Problem: Select
only the name and grade columns from the students data frame.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use the select()
function: To choose specific columns, use the select() function.
r
CopyEdit
select(students,
name, grade)
3. Expected
Output:
css
CopyEdit
name grade
1
Alice A
2
Bob B
3
Charlie A
4
David C
5
Eva B
Assignment
3: Arrange Data by Age
Problem: Arrange
the students data frame by age in descending order.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use the arrange()
function with desc() for descending order:
r
CopyEdit
arrange(students,
desc(age))
3. Expected
Output:
css
CopyEdit
name age grade
1
David 24 C
2
Bob 23 B
3
Alice 22 A
4
Eva 22 B
5 Charlie
21 A
Assignment
4: Create a New Column
Problem: Create a
new column age_in_months in the students data frame, which is calculated as age
* 12.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use the mutate()
function to create the new column:
r
CopyEdit
mutate(students,
age_in_months = age * 12)
3. Expected
Output:
css
CopyEdit
name age grade age_in_months
1
Alice 22 A
264
2
Bob 23 B
276
3 Charlie
21 A 252
4
David 24 C
288
5
Eva 22 B
264
Assignment
5: Summarize Data (Mean of Age)
Problem:
Calculate the average age of the students using the summarize() function.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use the summarize()
function to calculate the mean age:
r
CopyEdit
summarize(students,
average_age = mean(age))
3. Expected
Output:
nginx
CopyEdit
average_age
1 22.4
Assignment
6: Filter and Select Data Using Pipe (%>%)
Problem: Using
the students data frame, filter students older than 21, select only the name
and grade columns, and arrange them in ascending order by name.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Chain
operations using pipes:
r
CopyEdit
students %>%
filter(age > 21) %>%
select(name, grade) %>%
arrange(name)
3. Expected
Output:
css
CopyEdit
name grade
1
Alice A
2
Bob B
3
Eva B
Assignment
7: Group Data and Summarize (Average Age by Grade)
Problem: Group
the students data by grade and calculate the average age for each grade.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use group_by()
to group data by grade and summarize() to calculate the average age:
r
CopyEdit
students %>%
group_by(grade) %>%
summarize(average_age = mean(age))
3. Expected
Output:
css
CopyEdit
grade
average_age
1 A 21.5
2 B 22.3
3 C 24.0
Assignment
8: Filter and Mutate Data
Problem: Filter
students with grade "A", then create a new column age_in_days
(calculated as age * 365).
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use filter()
to get students with grade "A" and then mutate() to create the new
column:
r
CopyEdit
students %>%
filter(grade == "A") %>%
mutate(age_in_days = age * 365)
3. Expected
Output:
css
CopyEdit
name age grade age_in_days
1
Alice 22 A
8030
2 Charlie
21 A 7665
Assignment
9: Use arrange() with Multiple Variables
Problem: Arrange
the students data first by grade in ascending order, and then by age in
descending order.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Use arrange()
to sort by grade and age:
r
CopyEdit
arrange(students,
grade, desc(age))
3. Expected
Output:
css
CopyEdit
name age grade
1 Charlie
21 A
2
Alice 22 A
3
Bob 23 B
4
Eva 22 B
5 David 24
C
Assignment
10: Filter, Select, and Summarize in One Step
Problem: Filter
students older than 21, select name and age, and calculate the average age of
those students.
Step-by-Step Solution:
1. Load the
data (same as Assignment 1).
2. Chain filter(),
select(), and summarize() using pipes:
r
CopyEdit
students %>%
filter(age > 21) %>%
select(name, age) %>%
summarize(average_age = mean(age))
3. Expected
Output:
nginx
CopyEdit
average_age
1 23.0
Conclusion:
These assignments help reinforce
the key functions of dplyr, including filter(), select(), arrange(), mutate(), summarize(),
and chaining operations using pipes. They are designed to give you hands-on
experience with data manipulation in R and prepare you for more advanced data
analysis tasks. Make sure to carefully follow each step and understand how each
function contributes to the manipulation of data.
4o mini
You said:
WRITE 50 no of MCQs only FROM
ABOVE CONCEPT with explanation
ChatGPT
said:
ChatGPT
Multiple
Choice Questions (MCQs) on Data Manipulation with dplyr
1. Which
function in dplyr is used to filter rows based on conditions?
A) select()
B) filter()
C) arrange()
D) mutate()
Answer: B) filter()
Explanation: The filter() function is used to filter rows based on given
conditions (e.g., values of variables).
2. Which
function is used to select specific columns in a data frame?
A) select()
B) mutate()
C) arrange()
D) summarize()
Answer: A) select()
Explanation: The select() function is used to choose specific columns
from a data frame.
3. Which
function allows you to arrange rows in ascending or descending order?
A) filter()
B) arrange()
C) select()
D) mutate()
Answer: B) arrange()
Explanation: The arrange() function is used to order or sort rows in
ascending or descending order.
4. What
does the mutate() function do in dplyr?
A) It adds or transforms columns
in a data frame.
B) It filters rows based on conditions.
C) It summarizes the data.
D) It changes the order of rows.
Answer: A) It
adds or transforms columns in a data frame.
Explanation: The mutate() function is used to add new columns or modify
existing ones by applying transformations.
5. Which
operator is used to chain operations in dplyr?
A) =>
B) %>%
C) ::
D) &
Answer: B) %>%
Explanation: The pipe operator %>% is used to chain multiple
operations in a sequence, making the code more readable.
6. What
does the summarize() function do in dplyr?
A) It filters the data based on
specific criteria.
B) It aggregates data by calculating summary statistics.
C) It selects specific columns from the data frame.
D) It mutates existing columns.
Answer: B) It
aggregates data by calculating summary statistics.
Explanation: The summarize() function is used to aggregate data, often
to calculate measures like mean, sum, etc.
7. Which
function would you use to group data by one or more variables?
A) mutate()
B) arrange()
C) group_by()
D) filter()
Answer: C) group_by()
Explanation: The group_by() function is used to group data based on one
or more variables before applying summary functions.
8. In the
following code, what is the purpose of the %>% operator?
r
CopyEdit
students %>%
filter(age > 21) %>%
select(name, grade)
A) To assign a new column
B) To pass the result of one operation to the next
C) To filter columns
D) To perform data aggregation
Answer: B) To
pass the result of one operation to the next
Explanation: The %>% operator is used to pass the output of one
operation into the next function call, allowing for clean, readable code.
9. Which
of the following would return the names of students with grades "A"?
A) filter(students, grade ==
"A")
B) select(students, grade == "A")
C) arrange(students, grade == "A")
D) mutate(students, grade == "A")
Answer: A) filter(students,
grade == "A")
Explanation: The filter() function is used to return rows that meet the
condition grade == "A".
10. Which
function would you use to compute the average of a numeric column in dplyr?
A) summarize()
B) arrange()
C) select()
D) mutate()
Answer: A) summarize()
Explanation: The summarize() function is used to compute summary
statistics, such as the mean, for columns.
11. To
arrange data in descending order based on a column, which function do you use?
A) filter()
B) select()
C) arrange(desc())
D) mutate()
Answer: C) arrange(desc())
Explanation: The arrange() function can be used with desc() to sort data
in descending order based on a column.
12. Which
of the following dplyr functions is used to remove rows with missing values?
A) drop_na()
B) select()
C) mutate()
D) filter()
Answer: A) drop_na()
Explanation: The drop_na() function is used to remove rows with missing
(NA) values from the data frame.
13. What
is the purpose of the mutate() function when combined with ifelse()?
A) To filter data
B) To create a new variable based on a condition
C) To summarize the data
D) To group data
Answer: B) To
create a new variable based on a condition
Explanation: The mutate() function can be used with ifelse() to create
new variables based on specified conditions.
14. What
does the select() function return?
A) All columns in a data frame
B) Specific columns from a data frame
C) A sorted data frame
D) A summarized data frame
Answer: B)
Specific columns from a data frame
Explanation: The select() function returns specific columns from a data
frame based on the user's choice.
15. In dplyr,
what does the arrange() function do?
A) Groups data by a column
B) Filters rows based on conditions
C) Reorders rows based on specified variables
D) Creates new columns in a data frame
Answer: C)
Reorders rows based on specified variables
Explanation: The arrange() function is used to reorder the rows of a
data frame based on one or more columns.
16. Which
function is used to rename columns in dplyr?
A) rename()
B) mutate()
C) select()
D) group_by()
Answer: A) rename()
Explanation: The rename() function is used to rename columns in a data
frame.
17. How
do you create a new column in a data frame by performing an operation on
existing columns?
A) select()
B) arrange()
C) mutate()
D) summarize()
Answer: C) mutate()
Explanation: The mutate() function is used to create new columns or
modify existing columns.
18. How
can you display only the first 5 rows of a data frame?
A) head(data, 5)
B) tail(data, 5)
C) filter(data, row_number() <= 5)
D) select(data, top_n())
Answer: A) head(data,
5)
Explanation: The head() function returns the first n rows of the data
frame (default is 6).
19. Which
function in dplyr would you use to apply a transformation to a specific column?
A) select()
B) mutate()
C) arrange()
D) summarize()
Answer: B) mutate()
Explanation: The mutate() function is used to apply transformations to
specific columns.
20. What
is the correct way to group data by multiple columns using dplyr?
A) group_by(column1, column2)
B) group_by(column1 + column2)
C) group_by(column1 & column2)
D) group_by(column1 || column2)
Answer: A) group_by(column1,
column2)
Explanation: You can group data by multiple columns by separating the
column names with commas in the group_by() function.
21. Which
function do you use to remove duplicates from a data frame?
A) remove_duplicates()
B) distinct()
C) select()
D) mutate()
Answer: B) distinct()
Explanation: The distinct() function is used to remove duplicate rows
based on selected columns.
22. Which
function would you use to create a new column by transforming existing data in
a column?
A) summarize()
B) mutate()
C) select()
D) filter()
Answer: B) mutate()
Explanation: The mutate() function is used to apply transformations to
existing data and create new columns.
23. What
is the purpose of the summarize() function in dplyr?
A) To reorder the rows
B) To create new columns
C) To group data
D) To calculate summary statistics (e.g., mean, sum)
Answer: D) To
calculate summary statistics (e.g., mean, sum)
Explanation: The summarize() function is used to aggregate and summarize
data, calculating statistics like mean, sum, etc.
24. Which
function would you use to keep only rows with missing values?
A) drop_na()
B) mutate_na()
C) keep_na()
D) filter_na()
Answer: A) drop_na()
Explanation: The drop_na() function is used to remove rows with missing
values. Conversely, filter() can be used to keep rows with missing values.
25. Which
of the following is the correct use of arrange() in dplyr?
A) arrange(data, variable)
B) arrange(data, variable, desc(variable))
C) arrange(data, variable, variable2)
D) arrange(variable, data)
Answer: B) arrange(data,
variable, desc(variable))
Explanation: arrange() can be used to sort rows in ascending or
descending order by using desc().
26. What
does mutate() in dplyr allow you to do?
A) Group the data
B) Filter rows
C) Modify columns or add new columns
D) Aggregate data
Answer: C)
Modify columns or add new columns
Explanation: The mutate() function is used to modify or create new
columns in the data frame.
27. Which
function is used to group data by one or more columns in dplyr?
A) mutate()
B) arrange()
C) group_by()
D) select()
Answer: C) group_by()
Explanation: The group_by() function is used to group data by one or
more columns.
28. To
get the count of each unique value in a column, which function would you use?
A) count()
B) summarize()
C) group_by()
D) arrange()
Answer: A) count()
Explanation: The count() function is used to get the frequency count of
each unique value in a column.
29. What
function would you use to compute the standard deviation of a column using dplyr?
A) mean()
B) sd()
C) summarize()
D) mutate()
Answer: B) sd()
Explanation: sd() calculates the standard deviation of a column.
4o mini
You said:
rest
ChatGPT
said:
ChatGPT
Certainly! Here are the remaining
MCQs:
30. Which
function is used to rename columns in dplyr?
A) select()
B) rename()
C) mutate()
D) group_by()
Answer: B) rename()
Explanation: The rename() function is used to rename columns in a data
frame by providing the new name followed by the old name.
31. What
is the purpose of distinct() in dplyr?
A) To remove rows with missing
values
B) To filter rows based on conditions
C) To return unique rows based on specific columns
D) To create new columns
Answer: C) To
return unique rows based on specific columns
Explanation: The distinct() function is used to return unique rows,
removing duplicates based on specified columns.
32. How
do you compute the sum of values in a column using dplyr?
A) sum()
B) aggregate()
C) summarize()
D) mutate()
Answer: C) summarize()
Explanation: The summarize() function is used to compute summary
statistics like the sum of a column.
33. Which
function in dplyr allows you to change the order of rows?
A) arrange()
B) mutate()
C) filter()
D) group_by()
Answer: A) arrange()
Explanation: The arrange() function is used to change the order of rows
based on a variable or variables.
34. What
would filter(data, age > 25) do in dplyr?
A) Selects rows where age is less
than 25
B) Selects rows where age is greater than 25
C) Reorders rows based on age
D) Adds a new column based on age
Answer: B)
Selects rows where age is greater than 25
Explanation: The filter() function is used to filter rows where the
condition is met, in this case, age greater than 25.
35. What
does group_by() do in dplyr?
A) It calculates summary
statistics
B) It creates new columns
C) It groups data by specified variables
D) It filters the rows based on conditions
Answer: C) It
groups data by specified variables
Explanation: The group_by() function is used to group data by one or
more columns before performing operations like summarizing or mutating.
36. What
is the purpose of summarize() in dplyr?
A) It modifies or creates columns
B) It applies filters to data
C) It aggregates data and calculates summary statistics
D) It groups data based on conditions
Answer: C) It
aggregates data and calculates summary statistics
Explanation: The summarize() function is used to perform aggregation
like mean, sum, etc., on grouped data.
37. How
do you sort a column in descending order using arrange()?
A) arrange(data, column)
B) arrange(data, desc(column))
C) arrange(data, column, descending)
D) arrange(desc(data, column))
Answer: B) arrange(data,
desc(column))
Explanation: To sort in descending order, use arrange() along with desc()
to specify the column in descending order.
38. Which
function in dplyr is used to create a new variable based on a condition?
A) mutate()
B) select()
C) arrange()
D) summarize()
Answer: A) mutate()
Explanation: The mutate() function is used to create new variables or
modify existing ones based on conditions or operations.
39. How
do you remove rows with missing values in dplyr?
A) drop_na()
B) remove_na()
C) clean_na()
D) filter_na()
Answer: A) drop_na()
Explanation: The drop_na() function is used to remove rows with missing
values (NA).
40. Which
function is used to display the first few rows of a data frame?
A) tail()
B) head()
C) select()
D) summarize()
Answer: B) head()
Explanation: The head() function displays the first few rows (default is
6) of a data frame.
41. What
does mutate() do in dplyr?
A) It filters rows based on
conditions
B) It creates or modifies columns
C) It arranges rows in a specified order
D) It calculates summary statistics
Answer: B) It
creates or modifies columns
Explanation: The mutate() function is used to add new columns or modify
existing ones by applying transformations.
42. How
would you select columns by their names in dplyr?
A) select(column1, column2)
B) select(c(column1, column2))
C) select(columns=c(column1, column2))
D) select(c("column1", "column2"))
Answer: A) select(column1,
column2)
Explanation: The select() function is used to select columns by name,
either individually or by listing them.
43. How
do you find the number of distinct rows in a column using dplyr?
A) distinct(column)
B) count(distinct(column))
C) count(column)
D) summarize(count = n())
Answer: B) count(distinct(column))
Explanation: The distinct() function returns unique rows, and when used
with count(), it counts distinct values in a column.
44. What
is the role of the filter() function in dplyr?
A) To arrange the data by column
B) To create new columns
C) To filter rows based on specified conditions
D) To group data by variables
Answer: C) To
filter rows based on specified conditions
Explanation: The filter() function is used to return rows that meet
specific conditions or criteria.
45. Which
function would you use to join two data frames in dplyr?
A) bind_rows()
B) merge()
C) join()
D) left_join()
Answer: D) left_join()
Explanation: The left_join() function is used to join two data frames by
matching rows based on a common column, keeping all rows from the left data
frame.
46. What
does the summarize() function require as an argument?
A) A column name to be selected
B) An operation to be applied to columns (like mean or sum)
C) A condition to filter rows
D) A grouping variable
Answer: B) An
operation to be applied to columns (like mean or sum)
Explanation: The summarize() function requires a summary operation, such
as mean(), sum(), etc., applied to columns.
47. Which
function is used to add rows from one data frame to another?
A) bind_rows()
B) merge()
C) combine()
D) union()
Answer: A) bind_rows()
Explanation: The bind_rows() function is used to add rows from one data
frame to another, stacking them vertically.
48. How
can you create a new column in a data frame that is the square of an existing
column?
A) mutate(new_col =
existing_col^2)
B) mutate(existing_col = existing_col^2)
C) mutate(new_col = existing_col * 2)
D) mutate(new_col = square(existing_col))
Answer: A) mutate(new_col
= existing_col^2)
Explanation: The mutate() function is used to create a new column by
applying a transformation to an existing column.
49. What
is the function arrange() used for in dplyr?
A) To filter rows based on
conditions
B) To add new columns to a data frame
C) To reorder rows based on columns
D) To group data by columns
Answer: C) To
reorder rows based on columns
Explanation: The arrange() function is used to reorder rows based on the
values of one or more columns.
50. Which
dplyr function would you use to get the summary statistics for a group of
variables?
A) group_by()
B) summarize()
C) select()
D) filter()
Answer: B) summarize()
Explanation: The summarize() function is used to calculate summary
statistics, such as mean, sum, and count, for grouped data.

No comments:
Post a Comment