R Language Descriptive Statistics Step by step Implementation and Top 10 Questions and Answers
 .NET School AI Teacher - SELECT ANY TEXT TO EXPLANATION.    Last Update: April 01, 2025      19 mins read      Difficulty-Level: beginner

Descriptive Statistics in R Language: A Comprehensive Guide

Descriptive statistics play a vital role in data analysis by providing a summary of the main characteristics of a dataset. These statistics help us understand the basic features of the data, such as central tendency (mean, median), dispersion (range, variance, standard deviation), shape (skewness, kurtosis), and position (quartiles, percentiles). In the R programming language, there are numerous built-in functions and packages that facilitate the computation of descriptive statistics, making it an essential tool for any data analyst, statistician, or researcher.

Basic Descriptive Statistics Functions in R

  1. Mean: The mean() function calculates the average value of a numeric vector.

    data <- c(4, 2, 9, 5, 8)
    mean(data)
    #[1] 5.6
    
  2. Median: The median() function finds the middle value of a numeric vector when arranged in ascending order.

    median(data)
    #[1] 6
    
  3. Mode: Unlike Mean and Median, Mode is not a built-in function in base R. However, it can be easily calculated using a custom function:

    get_mode <- function(v) {
        uniq_vals <- unique(v)
        uniq_counts <- tabulate(match(v, uniq_vals))
        mode_val <- uniq_vals[which.max(uniq_counts)]
        return(mode_val)
    }
    
    get_mode(data)
    #[1] 4
    
  4. Range: To find the minimum and maximum values of a numeric vector, use the range() function.

    range(data)
    #[1] 2 9
    
  5. Variance: The var() function estimates the variance of a numeric vector.

    var(data)
    #[1] 6.64
    
  6. Standard Deviation: The sd() function computes the standard deviation, which is the square root of variance.

    sd(data)
    #[1] 2.57686
    
  7. Quantiles: The quantile() function can compute the quartiles, percentiles, and other quantiles of a numeric vector.

    quantile(data, c(0, 0.25, 0.5, 0.75, 1)) 
    #      0%      25%      50%      75%     100% 
    #      2        4         6         7.75     9 
    
  8. Summary Statistics: The summary() function provides a comprehensive overview including minimum, maximum, median, and first and third quartiles.

    summary(data)
    #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #   2.0    4.0    6.0    5.6    7.75    9.0 
    
  9. Minimum and Maximum Values: The min() and max() functions provide individual minimum and maximum values respectively.

    min(data)
    #[1] 2
    max(data)
    #[1] 9
    

Advanced Descriptive Statistics in R

While the above functions are useful, they only scratch the surface of what descriptive statistics can offer. Advanced analyses involve measures like skewness, kurtosis, and correlation coefficients, which are often computed using additional packages.

  1. Skewness: Skewness is a measure of symmetry. Positive skewness indicates the presence of outliers at the right end of the distribution, while negative skewness suggests outliers at the left end. It's calculated using the skewness() function from the moments package.

    install.packages("moments")
    library(moments)
    
    skewness(data)
    #[1] -0.2064292
    

    If skewness is close to 0, the data is approximately symmetric.

  2. Kurtosis: Kurtosis is a measure of the tails of a distribution compared to a normal distribution. High kurtosis means a distribution with heavy tails and a sharp peak, whereas low kurtosis indicates a distribution with lighter tails and a flatter peak. Use the kurtosis() function from the moments package for calculation.

    kurtosis(data)
    #[1] -1.153893
    

    A kurtosis value of about 3 indicates a normal distribution; values higher than 3 suggest fatter tails.

  3. Correlation Coefficient: Correlation measures the relationship between two variables. Pearson’s correlation coefficient is commonly used and can be calculated using the cor() function.

    x <- c(1, 2, 3, 4, 5)
    y <- c(2, 4, 5, 4, 5)
    
    cor(x, y, method = "pearson")
    #[1] 0.8660254
    

    Values close to 1 or -1 indicate strong positive or negative correlation, respectively.

  4. Covariance: Covariance measures how much two random variables vary together. It can be computed using the cov() function.

    cov(data, data)
    #[1] 6.64
    

Summary Statistics with Data Frames

When dealing with large datasets organized in data frames, it's often convenient to obtain summary statistics for multiple variables simultaneously. The summary() function works well with data frames.

df <- data.frame(
  Var1 = c(10, 20, 30, 40, 50),
  Var2 = c(1, 2, 1, 2, 2),
  Var3 = factor(c("A", "B", "A", "C", "B"))
)

summary(df)
#      Var1          Var2       Var3   
# Min.   :10.0   Min.   :1   A:2  
# 1st Qu.:20.0   1st Qu.:1   B:2  
# Median :30.0   Median :2   C:1  
# Mean   :30.0   Mean   :1.4        
# 3rd Qu.:40.0   3rd Qu.:2            
# Max.   :50.0   Max.   :2            

For more detailed statistics, the psych package offers the describe() function.

install.packages("psych")
library(psych)

describe(df)
# item n mean   sd median trimmed mad min max range skew kurtosis   se
# Var1  1 5 30.0 15.81 30.00    30.00  14.85  10  50    40   -.1440 -1.5493  7.0711
# Var2  2 5  1.4  0.49    2.00     1.40  1.48   1   2     1    .3499  2.7284  0.2211
# Var3  3 5  2.0     NA    2.00        NA    1   3     2    -.1556  1.5205    NA

Group-wise Summary Statistics

Often, we need descriptive statistics by subgroups. The dplyr package offers the summarise() or summarize() along with group_by() functions to compute these efficiently.

install.packages("dplyr")
library(dplyr)

# Assuming the previous dataframe 'df'
grouped_df <- df %>%
  group_by(Var2) %>%
  summarise(Mean_Var1 = mean(Var1), SD_Var1 = sd(Var1))

grouped_df
# # A tibble: 2 × 3
# Var2  Mean_Var1  SD_Var1
# <dbl>     <dbl>    <dbl>
#  1     1        20     NaN
#  2     2        40     NaN

Note: The standard deviation for groups of size 1 is NaN (Not a Number).

Visualization and Descriptive Statistics

Descriptive statistics are often supported by visual methods. Packages like ggplot2, boxplot, and hist() offer various ways to graphically represent the data.

install.packages("ggplot2")
library(ggplot2)

# Histogram of Var1
ggplot(df, aes(x = Var1)) + geom_histogram(binwidth = 10, fill = "lightblue")

# Boxplot comparing Var1 across different levels of Var2
ggplot(df, aes(x = as.factor(Var2), y = Var1)) + geom_boxplot(fill = "lightgreen")

Handling Missing Values

Missing values are common in datasets and can affect the accuracy of statistical calculations. The na.rm parameter in many functions allows removing NA values before computation. Additionally, the complete.cases() function helps identify rows with complete observations.

data_with_na <- c(4, 2, NA, 5, 8)
mean(data_with_na, na.rm = TRUE)
#[1] 5.666667

complete_cases_index <- complete.cases(data_with_na)
data_clean <- data_with_na[complete_cases_index]

Importance of Descriptive Statistics

  • Insight into Data Structure: Descriptive statistics provide crucial insights about data structure, helping analysts identify patterns, trends, and outliers.
  • Data Cleaning: They highlight potential inconsistencies and missing values, guiding the process of data cleaning.
  • Comparative Analysis: These statistics enable comparisons between different variables and subgroups within the same dataset.
  • Foundation for Inferential Statistics: Descriptive statistics set the foundation for inferential statistics by summarizing basic information about the population being studied.

In conclusion, descriptive statistics are indispensable tools for analyzing data effectively in R. Whether through built-in functions like mean(), median(), or advanced measures provided by packages like moments and psych, R provides ample capabilities to summarize and visualize data comprehensively. Understanding these methods ensures accurate data interpretation and analysis, leading to better decision-making processes.




Examples, Set Route, and Run the Application: Step-by-Step Guide for Beginners in R Language Descriptive Statistics

Descriptive statistics is a fundamental aspect of data analysis that involves summarizing and understanding the features of a dataset using quantitative measures such as the mean, median, mode, standard deviation, and variance. In R, a powerful and flexible programming language for statistical computing and graphics, descriptive statistics can be easily performed with built-in functions and packages. Below, we guide you through setting up your environment in R, running an example, and understanding the data flow step by step.

Setting Up Your Environment

  1. Install R and RStudio:

    • R: Download and install R from the Comprehensive R Archive Network (CRAN) at cran.r-project.org. Follow the instructions for your operating system.
    • RStudio: Download and install RStudio from the official website posit.co/download/rstudio-desktop/. RStudio provides an integrated development environment (IDE) for using R more efficiently.
  2. Create a New Project:

    • Open RStudio and go to File > New Project.
    • Choose New Directory, select New Project, and specify a name and location for your project.
    • Click on Create.
  3. Install Necessary Packages:

    • Some functions for descriptive statistics are built into R, but it's beneficial to install additional packages for more advanced analyses. Open the R console in RStudio and run the following commands:
      install.packages("dplyr")
      install.packages("ggplot2")
      install.packages("summarytools")
      
    • Load the libraries in your R script:
      library(dplyr)
      library(ggplot2)
      library(summarytools)
      

Running an Application: Descriptive Statistics Example

For demonstration, let's use the built-in mtcars dataset, which includes fuel consumption and ten aspects of automobile design for 32 automobiles.

Step 1: Load the Dataset

Load the mtcars dataset directly from R's base package:

data(mtcars)
Step 2: Examine the Dataset

Quickly explore the first few rows and the structure of mtcars to understand its variables:

head(mtcars)
str(mtcars)

Output:

  • head(mtcars) displays the first six rows.
  • str(mtcars) provides a summary of the data structure, including class and dimensions.
Step 3: Basic Summary Statistics

Use the summary() function to get basic statistics like mean, median, quartiles, and range for each numerical variable:

summary(mtcars)

Output:

  • Provides key statistics for each column in the dataset.
Step 4: Additional Descriptive Statistics

Calculate other useful statistics such as skewness and kurtosis using summarytools package:

dfSummary(mtcars)

Output:

  • Offers detailed summaries including counts, means, standard deviations, skewness, kurtosis, and more.
Step 5: Visualizations

Create visual summaries such as histograms and boxplots using ggplot2:

# Histogram for MPG (miles per gallon)
ggplot(mtcars, aes(x=mpg)) +
  geom_histogram(binwidth=2, fill="blue", color="black", alpha=0.7) +
  labs(title="Histogram of Miles Per Gallon", x="Miles Per Gallon", y="Frequency")

# Boxplot for MPG across different cylinders
ggplot(mtcars, aes(x=factor(cyl), y=mpg)) +
  geom_boxplot(fill="lightblue", color="black") +
  labs(title="Boxplot of MPG by Number of Cylinders", x="Number of Cylinders", y="Miles Per Gallon")

Output:

  • Histogram shows the distribution of miles per gallon.
  • Boxplot illustrates the variation of mpg based on the number of cylinders.

Understanding the Data Flow

  • Data Loading: The mtcars dataset is loaded from R's base package.
  • Data Exploration: head() and str() functions provide a preliminary understanding of the dataset.
  • Descriptive Statistics: summary() and dfSummary() functions generate essential statistics.
  • Visualization: ggplot2 is used to create graphical representations that help visualize distributions and comparisons.

By following these steps, you can effectively perform descriptive statistics on any dataset using R, gaining valuable insights into the data structure and characteristics. R's rich ecosystem of packages and functions makes it an excellent tool for statistical analysis and visualization, making it a great choice for beginners and advanced users alike.

This guide should provide a solid foundation for performing descriptive statistics in R, enabling you to apply these concepts to your own datasets confidently.




Top 10 Questions and Answers on R Language Descriptive Statistics

1. What are descriptive statistics in R, and why are they important?

Descriptive statistics in R are used to summarize and describe the main features of a dataset in a concise way. This includes measures like mean, median, mode, standard deviation, variance, minimum, maximum, range, quantiles, and summary statistics that help us understand the central tendency, dispersion, shape, and other essential characteristics of our data.

Why are they important? Descriptive statistics provide insights that allow us to make informed decisions without diving into complex statistical models. They help us understand the data better before applying any further analysis or modeling techniques. In practical terms, this could be anything from evaluating the effectiveness of advertising strategies in marketing to determining the baseline measurements in a clinical trial.

# Example: Compute basic descriptive statistics using the `summary` function
data <- c(12, 15, 16, 23, 34, 45, 56, 67, 89, 100)
summary(data)

2. How do you calculate the mean and median of a dataset in R?

Mean is the average of all values in a dataset, calculated by summing all the values and dividing by the number of observations.
Median is the middle value when the dataset is ordered; if there is an even number of observations, it's the average of the two middle numbers.

In R, you can use the mean() and median() functions to calculate these.

# Calculating mean
mean_vector <- mean(data)

# Calculating median
median_vector <- median(data)

print(mean_vector)
print(median_vector)

3. What is the difference between mode and how do we find it in R since it doesn’t have a built-in function?

The mode is the most frequently occurring value in a dataset. Unlike mean and median, R does not have a built-in function to calculate mode. We can write a custom function to achieve this.

Here is an example of how to define a function for the mode in R:

# Custom function to calculate mode
get_mode <- function(v) {
   uniq_values <- unique(v)
   uniq_counts <- tabulate(match(v, uniq_values))
   uniq_values[which.max(uniq_counts)]
}

# Example usage
mode_of_data <- get_mode(data)
print(mode_of_data)

4. How do you compute variance and standard deviation in R?

Variance captures how much the individual data points vary from the mean or how much the dataset is spread out. Standard deviation is the square root of the variance and provides a more interpretable measure of dispersion as it is in the same units as the original data.

In R, you can use the var() and sd() functions to compute these:

# Calculate variance
variance_data <- var(data)

# Calculate standard deviation
sd_data <- sd(data)

print(variance_data)
print(sd_data)

5. What is a quantile, and how do you compute it in R?

A quantile divides the data into equal parts. The quartiles (specifically, Q1, Q2, and Q3), which split the data into quarters, are the most commonly known type of quantiles. In R, the quantile() function is used to compute quantiles.

Here is an example of how to compute the first quintile (20th percentile):

# Calculate the first quintile (20th percentile)
first_quintile <- quantile(data, 0.2)
print(first_quintile)

This function can also be used to compute multiple quantiles at once:

# Calculate multiple quantiles (e.g., quartiles)
quartiles <- quantile(data, probs = c(0.25, 0.50, 0.75))
print(quartiles)

6. How do you create a boxplot in R to visualize descriptive statistics?

A boxplot gives a good visual representation of the distribution of data through its quartiles and potential outliers.

Here’s how to create a boxplot in R:

# Create a boxplot
boxplot(data,
        main="Boxplot",
        xlab="Sample Data")

# Adding more datasets for comparison
more_data <- matrix(rnorm(100*5, mean=rep(1:5, each=100), sd=1), ncol=5)
boxplot(more_data,
        main="Comparison Boxplot",
        xlab="Different Samples",
        ylab="Values")

7. How do you create histograms and density plots in R?

Histograms help visualize the frequency distribution of a variable. Density plots provide a smoothed version of histograms.

Here’s an example of creating histograms:

# Histogram plot
hist(more_data[,1],
     main="Histogram",
     xlab="Value",
     col="lightblue",
     border="black")

And here’s an example of creating a density plot:

# Density plot
plot(density(more_data[,1]),
     main="Density Plot",
     xlab="Value",
     ylab="Density",
     col="blue",
     lwd=2)

8. What are some useful summary statistics functions in R besides summary?

Besides the summary() function, there are several other functions in R that provide detailed summary statistics:

  • mean(): Computes the arithmetic mean.
  • median(): Computes the median.
  • var(): Computes the variance.
  • sd(): Computes the standard deviation.
  • quantile(): Computes sample quantiles.
  • min() and max(): Computes the minimum and maximum values.
  • range(): Provides a vector containing the minimum and maximum values of a numeric vector.
  • IQR(): Computes the inter-quartile range of a numeric vector.
  • fivenum(): Returns Tukey’s five-number summary (minimum, lower-hinge, median, upper-hinge, maximum).
  • length(): Gives the length of a vector.
  • length(unique(x)): Number of unique elements.
  • table(): Creates a contingency table.
  • st() function from DescTools package: Provides a comprehensive summary of a dataset including skewness, kurtosis, etc.
library(DescTools)
st(data)

9. How do you find the skewness and kurtosis of a dataset in R?

Skewness describes asymmetry from the normal distribution in a set of data. Positive skew indicates a tail on the right side of the distribution field; negative skew indicates a tail on the left side.

Kurtosis characterizes the shape of a distribution’s tails in relation to its overall spread and is used as a descriptor of the extreme values in one versus the other tail of a distribution.

To calculate skewness and kurtosis in R, you can use the skewness() and kurtosis() functions from the moments package.

# Install moments package if not already installed
install.packages("moments")

# Load moments package
library(moments)

# Calculate skewness
skewness_of_data <- skewness(data)

# Calculate kurtosis
kurtosis_of_data <- kurtosis(data)

print(skewness_of_data)
print(kurtosis_of_data)

10. How do you perform a correlation analysis on a dataset in R?

Correlation measures the linear relationship between two variables and ranges from -1 to +1. A perfect positive correlation (+1) means that both variables move in the same direction proportionally; a perfect negative correlation (-1) means that as one variable increases, the other decreases proportionally. Zero correlation means no linear relationship.

To compute a correlation coefficient and visualize it, you can use:

# Generating sample data
set.seed(123)
x <- rnorm(100)
y <- rnorm(100) + 0.5*x

# Compute correlation coefficient
correlation_coefficient <- cor(x, y)
print(correlation_coefficient)

# Scatterplot with correlation coefficient
plot(x, y,
     main = paste("Scatterplot with Correlation Coefficient", round(correlation_coefficient, 2)),
     xlab = "Variable X",
     ylab = "Variable Y")

# Add correlation line using abline for visual representation
abline(lm(y ~ x), col="red", lwd=2)

Using cor.test for hypothesis testing and confidence intervals:

# Perform correlation test
cor_test_results <- cor.test(x, y)
print(cor_test_results)

These functions (cor() and cor.test()) compute Pearson's product-moment correlation coefficient by default but also support other types like Spearman and Kendall correlations.

# Compute Spearman correlation coefficient
spearman_correlation <- cor(x, y, method = "spearman")
print(spearman_correlation)

# Compute Kendall correlation coefficient
kendall_correlation <- cor(x, y, method = "kendall")
print(kendall_correlation)

Descriptive statistics using R provide a powerful toolset for analyzing and understanding data sets. They are essential building blocks for advanced data analyses and machine learning tasks. By summarizing data efficiently, you can uncover underlying patterns, identify anomalies, and make informed business decisions based on your data.