• LOGIN
  • No products in the cart.

Chapter 5: Data Science with R: Getting Started

Data Science with R: Getting Started

The latest data surge will not abate anytime soon. In fact, according to IDC research, the volume of data created in 2025 will surpass 175 zettabytes. Dealing with this vast volume of data is a challenge for all businesses across all industries. As a result, companies worldwide search for individuals who can interpret data and generate relevant and actionable insights.

Here comes data science. In this guide, you will learn data science with R.

Introduction to R

R is a free, open-source programming language extensively used in statistical applications and data analysis. R often has a command-line interface. R is accessible on popular operating systems, including Windows, Linux, and macOS.Furthermore, the R programming language is the most recent cutting-edge technology.

It was created in New Zealand by Ross Ihaka and Robert Gentleman and is being developed by the R Development Core Team. The R programming language is a variant of the S programming language. It integrates Scheme-inspired lexical scoping semantics. It was conceived in 1992, with an early version released in 1995 and a stable beta version issued in 2000.

Features of R

R provides several statistical and graphical tools. It contains an extensive package library that simplifies developing machine learning algorithms. In addition, it is simple to combine with popular tools such as Tableau and Microsoft SQL Server.

R is more than simply a programming language; it also features a global repository system called CRAN (Comprehensive R Archive Network). It is available at https://cran.r-project.org/.

It includes all major updates, R sources, R binaries, R packages, and documentation. CRAN hosts about 10,000 R packages.

Statistical Features:

  1. Basic Statistics: The mean, mode, and median are the most often used fundamental statistics words. All of them refer to “Measures of Central Tendency.” So, we can assess central tendency with the R programming language.
  2. Static graphics: R provides several tools for producing and developing intriguing static graphics. R supports many plot forms, including graphic maps, mosaic plots, biplots, etc.
  3. Probability distributions: Probability distributions are important in statistics, and we can easily handle several types of probability distributions using R, such as the Binomial Distribution, Normal Distribution, Chi-squared Distribution, and many more.
  4. Data analysis: It provides comprehensive, consistent, and integrated data analysis capabilities.

Programming Features:

  1. R Packages: One of R’s most notable aspects is its abundance of libraries. CRAN (Comprehensive R Archive Network) is a repository for R that contains over 10,000 packages.
  2. Distributed computing: Distributed computing is a technique in which software system components are shared among numerous computers to increase efficiency and performance. In November 2015, two new R packages for distributed programming, DDR and multidplyr, were published.

Applications of R:

  1. For data science, R is incredibly useful. R provides data scientists with a wide range of statistics-related libraries. It also serves as a platform for statistical computing and design.
  2. Many quantitative analysts utilize R as a programming language. As a result, it aids in data import and cleansing.
  3. R is one of the most common languages besides Python. A large number of data analysts and research programmers use it. As a result, it is employed as a fundamental financial instrument.
  4. R is used by tech behemoths such as Google, Facebook, Bing, Twitter, Accenture, Wipro, and many others.

R and Python are both essential tools in data research. However, it is difficult for newcomers to decide whether R or Python is better or more appropriate.

Installation of R

R is available for free on the CRAN website. You may download an operating system by selecting it and clicking on it. To complete the installation, stick to the default settings.

You may also install RStudio, an integrated R development environment. It is available in two different formats: RStudio Desktop is a standard desktop program. Simultaneously, the RStudio Server operates on a distant server and provides RStudio access via a web browser.

Install packages and their dependencies before you begin programming in R. Packages are pre-assembled groups of functions and objects. The CRAN repository hosts each package. You can add any package anytime as not all of them are loaded by default.

To add a new package to RStudio, navigate to Tools -> Install Packages.

Then you may search for the package you want to install and choose where to install it.

There are various data structures accessible in the R programming language:

  1. Vectors: The most fundamental R object with atomic values.
  2. Matrices: These are R objects with elements organized in a two-dimensional grid. They also have the same sorts of components.
  3. Arrays: Arrays are data structures that can hold data in more than two dimensions. Making an array with two, three, or four dimensions will generate four rectangular matrices. Each matrix has two rows and three columns.
  4. Data Frames: It is a table containing one variable’s values in each column and one set of values from each column in each row.
  5. Lists: A list comprises components of various sorts (numbers, strings, vectors, etc.) Its elements can either be a matrix or a function. The list() method is used to generate the list.

Importing and Exporting files in R

Importing files in R 

Using R, you can import data from a variety of sources:

  1. Table: We can use the read.table function in R to load a table.
  2. CSV: Use the read.csv command to import an a.csv file.

Exporting Files in R

In R, you may also export various files to a different place.

  1. Write.table to export a table.
  2. Write.xls to export an Excel file.
  3. To export a CSV file, use the following syntax: Write.csv(file name, “c:/file name.csv”).

Data Visualization in R

R includes robust graphics tools that aid with data visualization. These drawings may be seen on the screen and saved in various formats such as .pdf, .png, .jpg, .wmf, and.ps. In addition, it may be adjusted to meet your specific graphic demands and allows you to copy & paste into Word or PowerPoint documents.

We can create a bar chart, pie chart, histogram, kernel density plot, line chart, boxplot, heat map, and word cloud.

Consider boxplots in R.

Boxplots are frequently referred to as whisker diagrams. They will show the data distribution depending on the following parameters:

  • Minimum
  • First quartile
  • Median
  • Third quartile
  • Maximum

You must first give a boxplot (data) to make a boxplot.

The bar at the bottom of the box represents the minimum value, while the bar at the top represents the maximum value. A bold line indicates the median value, and a dot outside the box indicates outliers.

Now that you’ve learned more about data visualization in R let’s dive into the various stages of the data science life cycle.

Data Science Life Cycle

The steps of a typical data science life cycle are as follows:

  1. Data Acquisition: The first phase in any data science project’s life cycle is gathering the necessary data from various sources. Data acquisition is gathering information from various internal and external sources that we may use to solve business questions. We may retrieve data from multiple sources, including web server logs, social media data, online repositories, and databases.
  2. Data Preparation: Data Preparation is an essential phase in the life cycle, often known as data cleansing or data wrangling. Data received from numerous sources are usually jumbled and frequently lacks certain variables. As a result, it is crucial to clean this data to extract value from it.
  3. Data Exploration: After cleaning the data, you may test hypotheses and display the data to understand it better. Data exploration is also known as data mining. Statistical analysis is utilized to uncover patterns in your data collection and find crucial perspective characteristics.
  4. Predictive Modeling: You must create predictive models to train your machine to generate predictions. You must first select the appropriate algorithm for training the machine. Following that, historical data is divided into training and validation sets. The trained model is verified using the validation dataset, and the model’s accuracy and efficiency are then assessed.
  5. Model interpretation and deployment: After thoroughly examining the model, deploy it into a production-like environment for final user acceptance. You should show your model to a non-technical individual and convey the data’s actionable findings.

Now that we’ve covered the various data science life cycle stages, let’s look at some data science algorithms that may assist you in solving complicated business challenges.

Linear Regression with R

Linear regression is a statistical approach used to discover correlations between one or more independent variables and a dependent variable. It predicts the result of a continuous (numerical) variable. It is commonly used in stock market research, weather forecasting, and sales forecasting.

In two phases, linear regression is used:

  • Calculate the relationship between two variables. For instance, can body weight affect blood cholesterol levels? Will the size of the house have an impact on house prices?
  • Based on the other independent factors, forecast the value of the dependent variable. The following formula depicts the simplest version of a basic linear regression equation with one dependent and one independent variable:

y=m*x+c

Where y is the dependent variable, x denotes the independent variable, m represents the slope, and c denotes the line’s intercept/coefficient.

There are two types of Linear Regression:

  1. Simple linear regression: A regression model uses a straight line to evaluate the association between an independent and dependent variable. Both variables should be numerical.
  2. Multiple linear regression: Often known as multiple regression, is a statistical approach that predicts the result of a response variable using numerous explanatory factors. Multiple regression is a variant of linear regression that employs only one explanatory variable.

Linear Regression Analysis in R 

We’ll utilize a standard built-in automobiles dataset to discover the connection between variables in this study.

  • head(cars) – This shows the first six rows of the data frame.
  • str(cars) – Displays the data frame’s structure (50 observations and two variables)
  • plot(cars) – Displays a scatter plot of speed vs distance.
  • plot(cars$dist, cars$speed): It will add a second plot.

The connection between two continuous variables is investigated using correlation analysis. But, first, the correlation coefficient between the two variables must be calculated.

If the value of one variable regularly grows when the value of the other increases, they have a high positive correlation (value near +1).

Now that we’ve seen how the linear regression technique works in R let’s look at decision trees.

Decision Trees

Decision trees are tree-shaped algorithms used to decide on a course of action. Each tree branch symbolizes a potential choice, event, or reaction.

  1. Root Node: The root node indicates the complete or sample data, split into two or more homogenous sets.
  2. Splitting: The division of a node into two or more sub-nodes.
  3. Decision Node: When a sub-node divides into other sub-nodes, it is referred to as a decision node.
  4. Leaf/terminal Node: Nodes with no offspring (no further split) are called leaf or terminal nodes.
  5. Pruning: Pruning is the technique of reducing the size of decision trees by node reduction (the reverse of splitting).
  6. Branch/sub-tree: A branch or sub-tree is a subset of the decision tree.
  7. Parent and child node: A node split into sub-nodes is referred to as a parent node of sub-nodes, while sub-nodes are the children of parent nodes.

Before developing a decision tree method, you should understand two more concepts: entropy and information gain.

Entropy is a measure of the dataset’s unpredictability or impurity. The decrease in entropy after splitting the dataset is measured as information gain. It is sometimes referred to as entropy reduction.

Conclusion

You now better understand how data science works and why it is valuable. You investigated how to install R and RStudio and the many R capabilities. You also learned about the various data structures in R. Finally; you saw how to categorize flowers using the decision tree approach after learning about linear regression and how it works in R.


Previous


Next

GoLogica Technologies Private Limited. All rights reserved 2024.