Language r what. The R programming language and its place among statistical programs. What the R environment looks like

Let's talk a little about a programming language called R. In recent times you could read articles in our blogs about and, those areas where you just need to have a powerful language at hand for working with statistics and graphs. And R is one of those. It might be hard for a newcomer to the programming world to believe this, but today R is already more popular than SQL, it is actively used in commercial organizations, research and universities.

Without getting into the rules, syntax, and specific uses, let's just go through the essential books and resources to help you learn R.

What is the R language, why do you need it and how you can use it wisely, you can learn from the wonderful Ruslan Kuptsov, which he spent a little less than a year ago as part of GeekWeek-2015.

Books

Now that there is a certain order in your head, you can start reading literature, since it is more than enough. Let's start with domestic authors:

Internet resources

Any person who wants to learn a programming language must visit two resources in search of knowledge: the official website of its developers and the largest online community. Well. we will not make an exception for R either:

But again imbued with care for those who English I haven't managed to learn it yet, but he really wants to study R, we will mention several Russian resources:

In the meantime, let's complete the picture with a small list of English-language, but no less informative sites:

CRAN is actually a place where you can download the R development environment to your computer. Besides, manuals, examples and other useful reading materials;

Quick-R - briefly and clearly about statistics, methods of its processing and the R language;

Burns-Stat - about R and about its predecessor S with huge amount examples;

R for Data Science, another book from Garrett Grolemund, translated into an online textbook;

Awesome R - compilation better code from the official site, hosted on our beloved GitHub;

Mran is the R language from Microsoft;

Tutorial R is another resource with organized information from the official site.

The following topic prompted me to write this article: In search of the perfect post, or the riddle of the habr. The fact is that after getting acquainted with the R language, I look extremely askance at any attempts to calculate something in the Excel. But I must admit that I met R only a week ago.

Purpose: Collect data from your favorite HabraHabr using the R language and carry out, in fact, what the R language was created for, namely: statistical analysis.

So, after reading this topic, you will find out:

How you can use R to retrieve data from web resources
How to transform data for later analysis
What resources are highly recommended for everyone who wants to get to know R better?

The reader is expected to be independent enough to familiarize himself with the basic constructs of the language. For this, after all, the links at the end of the article are better suited.

Preparation

We will need the following resources:

After installation, you should see something like this:

In the bottom right pane, under the Packages tab, you can find a list of installed packages. We need to additionally install the following:

Rcurl - for networking. Anyone who has worked with CURL will immediately understand all the possibilities that open up.
XML - a package for working with the DOM tree of an XML document. We need a functionality for finding elements by xpath

Click "Install Packages", select the ones you need, and then select them with a tick so that they are loaded into the current environment.

We receive data

To get a DOM object of a document obtained from the Internet, it is enough to execute the following lines:
url<-"http://habrahabr.ru/feed/posts/habred/page10/" cookie<-"Мои сверхсекретные печеньки" html<-getURL(url, cookie=cookie) doc<-htmlParse(html)
Pay attention to the transmitted cookies. If you want to repeat the experiment, then you will need to substitute your cookies, which your browser receives after authorizing on the site. Next, we need to get the data we are interested in, namely:

When the post was posted
How many views were there
How many people have added an entry to favorites
How many clicks were +1 and -1 (in total)
How many were +1 clicks
How much -1
Current rating
Number of comments

Without going into too much detail, I will immediately give the code:
published<-xpathSApply(doc, "//div[@class="published"]", xmlValue) pageviews<-xpathSApply(doc, "//div[@class="pageviews"]", xmlValue) favs<-xpathSApply(doc, "//div[@class="favs_count"]", xmlValue) scoredetailes<-xpathSApply(doc, "//span[@class="score"]", xmlGetAttr, "title") scores<-xpathSApply(doc, "//span[@class="score"]", xmlValue) comments<-xpathSApply(doc, "//span[@class="all"]", xmlValue) hrefs<-xpathSApply(doc, "//a[@class="post_title"]", xmlGetAttr, "href")
Here we have used xpath search for elements and attributes.
Further, it is highly recommended to form a data.frame from the received data - this is an analogue of database tables. It will be possible to make requests of different levels of complexity. Sometimes you wonder how elegant a thing can be done in R.
posts<-data.frame(hrefs, published, scoredetailes, scores, pageviews, favs, comments)
After the formation of the data.frame, it will be necessary to correct the received data: convert the lines to numbers, get the real date in normal format, etc. We do it this way:

Posts $ comments<-as.numeric(as.character(posts$comments)) posts$scores<-as.numeric(as.character(posts$scores)) posts$favs<-as.numeric(as.character(posts$favs)) posts$pageviews<-as.numeric(as.character(posts$pageviews)) posts$published<-sub(" декабря в ","/12/2012 ",as.character(posts$published)) posts$published<-sub(" ноября в ","/11/2012 ",posts$published) posts$published<-sub(" октября в ","/10/2012 ",posts$published) posts$published<-sub(" сентября в ","/09/2012 ",posts$published) posts$published<-sub("^ ","",posts$published) posts$publishedDate<-as.Date(posts$published, format="%d/%m/%Y %H:%M")

It is also useful to add additional fields that are calculated from those already received:
scoressplitted<-sapply(strsplit(as.character(posts$scoredetailes), "\\D+", perl=TRUE),unlist) if(class(scoressplitted)=="matrix" && dim(scoressplitted)==4) { scoressplitted<-t(scoressplitted) posts$actions<-as.numeric(as.character(scoressplitted[,1])) posts$plusactions<-as.numeric(as.character(scoressplitted[,2])) posts$minusactions<-as.numeric(as.character(scoressplitted[,3])) } posts$weekDay<-format(posts$publishedDate, "%A")
Here we have transformed the well-known messages of the form "Total 35: 29 and ↓ 6" into an array of data by how many actions were performed in general, how many pluses and how many minuses there were.

On this, we can say that all data is received and converted to a format ready for analysis. I designed the code above as a ready-to-use function. At the end of the article, you can find a link to the source.

But the attentive reader has already noticed that in this way, we received data for only one page to get for a whole series. To get data for a whole list of pages, the following function was written:

GetPostsForPages<-function(pages, cookie, sleep=0) { urls<-paste("http://habrahabr.ru/feed/posts/habred/page", pages, "/", sep="") ret<-data.frame() for(url in urls) { ret<-rbind(ret, getPosts(url, cookie)) Sys.sleep(sleep) } return(ret) }
Here we use the system function Sys.sleep, so as not to accidentally arrange the habra effect on the habr itself :)
This function is proposed to be used as follows:
posts<-getPostsForPages(10:100, cookie,5)
Thus, we download all pages from 10 to 100 with a pause of 5 seconds. Pages up to 10 are not interesting to us, since the grades are not visible there yet. After a few minutes of waiting, all of our data is in the posts variable. I recommend saving them right there so as not to bother Habr every time! This is done in this way:
write.csv (posts, file = "posts.csv")
And we read it as follows:
posts<-read.csv("posts.csv")

Hooray! We learned how to get statistical data from Habr and save them locally for the next analysis!

Data analysis

I will leave this section unsaid. I suggest that the reader play with the data himself and get his own conclusions. For example, try to analyze the dependence of the mood of plus and minus, depending on the day of the week. I will give only 2 interesting conclusions that I made.

Habr users are much more willing to add than minus.

This can be seen in the following graph. Notice how much the "cloud" of minuses is more uniform and wider than the spread of pluses. The correlation of pros versus views is much stronger than for cons. In other words: we give plus without thinking, but we minus for the cause!
(I apologize for the inscriptions on the charts: I have not yet figured out how to display them correctly in Russian)

There are indeed several classes of posts

This statement was used for granted in the mentioned post, but I wanted to make sure of it in reality. To do this, it is enough to calculate the average share of pluses to the total number of actions, the same for minuses and divide the second by the first. If everything were uniform, then we should not observe many local peaks on the histogram, but they are there.

As you can see, there are pronounced peaks around 0.1, 0.2 and 0.25. I suggest the reader to find and "name" these classes himself.
I want to note that R is rich in algorithms for clustering data, for approximation, for testing hypotheses, etc.

Useful Resources

If you really want to dive into the world of R, I recommend the following links. Please share your interesting R blogs and sites in the comments section. Is there anyone writing about R in Russian?

I want to talk about using the free statistical analysis environment R. I consider it as an alternative to statistical packages like SPSS Statistics. To my deep regret, she is completely unknown in the vastness of our Motherland, but in vain. I believe that the ability to write additional statistical analysis routines in S makes the R system a useful data analysis tool.

In the spring semester of 2010, I had the opportunity to lecture and conduct practical classes in the course "Statistical Data Analysis" for students of the Department of Intelligent Systems of the Russian State University for the Humanities.

My students pre-took a semester course in probability theory, covering the basics of discrete probability spaces, conditional probabilities, Bayes' theorem, the law of "large numbers", some knowledge of the normal law, and the Central Limit Theorem.

About five years ago, I was already teaching the (then still combined) semester course "Foundations of Probability Theory and Mathematical Statistics," so I expanded my notes (given out to students before each lesson) on statistics. Now, when the RGGU has a student server isdwiki.rsuh.ru of the department, I simultaneously upload them to FTP.

The question arose: what program should I use to conduct practical exercises in a computer class? The often used Microsoft Excel was rejected both due to proprietary nature and due to incorrect implementation of some statistical procedures. You can read about this, for example, in the book by AA Makarov and Yu.N. Tyurin "Statistical analysis of data on a computer." Calc spreadsheets from the free office suite Openoffice.org were russified so that I can hardly find the required function (their names were also abbreviated disgustingly).

The most commonly used package is SPSS Statistics. SPSS is now being taken over by IBM. Among the advantages of IBM SPSS Statistics, I will highlight:

Convenient loading of data of various formats (Excel, SAS, via OLE DB, via ODBC Direct Driver);
Availability of both command language and branched menu system for direct access to various statistical analysis procedures;
Graphic means for displaying results;
Built-in Statistics Coach that interactively suggests an adequate analysis method.

The disadvantages of IBM SPSS Statistics in my opinion are:

Paid even for students;
The need to obtain (additionally paid) modules containing special procedures;
Only 32-bit Linux operating systems are supported, although both 32-bit and 64-bit Windows are supported.

Alternatively, I chose the system. This system began to be developed through the efforts of Robert Gentleman and Ross Ihak at the Department of Statistics at the University of Melbourne in 1995. The first letters of the names of the authors determined its name. Subsequently, leading statisticians joined the development and expansion of this system.

I consider the advantages of the system under discussion:

Redistributing the program under the GNU Public License;
Availability of both sources and binaries in the extensive CRAN (The Comprehensive R Archive Network). For Russia, this is the cran.gis-lab.info server;
The presence of an installation package for Windows (works on both 32-bit and 64-bit Vista). It turned out by chance that the installation does not require administrator rights under Windows XP;
The ability to install from the repository on Linux (works for me on 64-bit version of Ubuntu 9.10);
The presence of its own programming language for statistical procedures R, which has actually become a standard. It is, for example, fully supported by the new IBM SPSS Statistics Developer;
This language is an extension of the S language developed at Bell Labs and currently forms the basis of the commercial S-PLUS system. Most programs written for S-PLUS can be easily executed in the R environment;
The ability to exchange data with spreadsheets;
The ability to save the entire history of calculations for documentation purposes.

For the first lesson, a CD was prepared, on which the installation files, documentation and manuals were recorded. I will tell you more about the latter. CRAN provides detailed user guides for installation, R (and its subset S), writing additional statistical procedures, exporting and importing data. The Contributed Documentation has a large number of publications by statisticians who use this package in their teaching process. Unfortunately, there is nothing in Russian, although, for example, there is even Polish. From the English-language books I would like to mention "Using R for introductory statistics" by Professor John Verzani from the City University of New York and "Introduction to the R project for Statistical Computing" by Professor Rossiter (Holland) from the International Institute of Geoinformatics and Earth Observations.

The first lesson was devoted to the installation and training to use the package, familiarity with the syntax of the R language. Calculations of integrals by the Monte Carlo method were used as a test problem. Here is an example of calculating the probability of rv. with exponential distribution with parameter 3, take a value less than 0.5 (10000 is the number of attempts).
> x = runif (10000,0,0.5)
> y = runif (10000,0,3)
> t = y<3*exp(-3*x)
> u = x [t]
> v = y [t]
> plot (u, v)
> i = 0.5 * 3 * length (u) / 10000

The first two lines set the uniform distribution of points in the rectangle x, then those points that fall under the graph of exponential density 3 * exp (-3 * x) are selected, the plot function displays the points in the graphical output window, and finally the required integral is calculated.
The second lesson was devoted to calculating descriptive statistics (quantiles, median, mean, variance, correlation, and covariance) and plotting graphs (histograms, box-and-mustaches).
In the following lessons, the Rcmdr library was used. It is a graphical user interface (GUI) for the R environment. The library is being developed by Professor John Fox of McMaster University in Canada.

This library is installed by executing the command install.packages ("Rcmdr", dependencies = TRUE) inside the environment R. If the environment itself is an interpreter of the R language, then the add-in "Rcmdr" is an additional window equipped with a menu system containing a large number of commands corresponding to standard statistical procedures. This is especially convenient for courses where the main thing is to teach the student to press buttons (unfortunately, there are more and more of these now).

Seminar notes from my previous course have been expanded. They are also available via FTP from isdwiki.rsuh.ru. These notes contained tables of critical values that were used for calculations at the board. This year, students were asked to solve these problems on a computer, and also to check the tables using the (normal) approximations also indicated in the notes.

There were also some of my mistakes. For example, I realized too late that Rcmdr allows data to be imported from downloaded packages, so relatively large samples were only processed in regression analysis classes. When presenting nonparametric tests, the students entered the data by hand using my notes. Another drawback, as I understand it now, was the insufficient number of homework assignments for writing rather complex programs in the R language.

It should be noted that several senior students attended my classes, and some downloaded materials from lectures and seminars. Students of the Department of Intelligent Systems of the Russian State University for the Humanities receive fundamental training in mathematics and programming, so the use of the R environment (instead of spreadsheets and statistical packages with fixed statistical procedures) seems to me very useful.

If you are faced with the task of studying statistics, and especially writing non-standard procedures for statistical data processing, then I recommend turning your attention to the R package.

Statistical analysis is an integral part of scientific research. High-quality data processing increases the chances of publishing an article in a reputable journal and bringing research to an international level. There are many programs that can provide high-quality analysis, but most of them are paid, and often a license costs from several hundred dollars or more. But today we are going to talk about a statistical environment that does not have to pay for, and its reliability and popularity rivals the best commercial statistics. packages: we'll get to know R!

What is R?

Before giving a clear definition, it should be noted that R is more than just a program: it is both an environment, and a language, and even movement! We'll look at R from different angles.

R is a computing environment developed by scientists for data processing, mathematical modeling and graphics. R can be used as a simple calculator, you can, you can perform simple statistical analyzes (for example, ANOVA or regression analysis) and more complex time-consuming calculations, test hypotheses, build vector graphics and maps. This is not a complete list of what you can do in this environment. It should be noted that it is distributed free of charge and can be installed both on Windows and on UNIX class operating systems (Linux and MacOS X). In other words, R is a free and cross-platform product.

R is a programming language, thanks to which you can write your own programs ( scripts) with the help, as well as use and create specialized extensions ( packages). A package is a collection of files with help information and examples, put together in one archive. play an important role as they are used as additional extensions based on R. Each package is usually dedicated to a specific topic, for example: the "ggplot2" package is used to create beautiful vector plots of a certain design, and the "qtl" package is ideal for genetic mapping ... There are over 7000 such packages in the R library at the moment! All of them are checked for errors and are in the public domain.

R is a community / movement.
Since R is a free open source product, its development, testing and debugging is not done by a separate company with hired staff, but by the users themselves. Over the past two decades, a huge community has formed from the core of developers and enthusiasts. According to the latest data, more than 2 million people have helped in one way or another to develop and promote R on a voluntary basis, from translating documents, creating training courses, and ending with the development of new applications for science and industry. There are a huge number of forums on the Internet where you can find answers to most of the questions related to R.

What does the R environment look like?

There are many "skins" for R, which can vary greatly in appearance and functionality. But we will briefly cover only three of the most popular options: Rgui, Rstudio, and R, running in a Linux / UNIX terminal as a command line.

The R language in the world of statistical programs

At the moment, there are dozens of high-quality statistical packages, among which the clear leaders are SPSS, SAS and MatLab. However, in 2013, despite high competition, R became the most used statistical analysis software in scientific publications (http://r4stats.com/articles/popularity/). In addition, in the last decade, R has become more and more in demand in the business sector: such giants as Google, Facebook, Ford and the New York Times are actively using it to collect, analyze and visualize data (http: //www.revolutionanalytics .com / companies-using-r). In order to understand the reasons for the growing popularity of the R language, let us pay attention to its common features and differences from other statistical products.

In general, most statistical tools can be classified into three types:

GUI programs based on the principle "click here, here and get the finished result";

statistical programming languages with which basic programming skills are required;

"mixed" which also have a graphical interface ( GUI), and the ability to create scripting programs (for example: SAS, STATA, Rcmdr).

Features of programs with GUI

Programs with a graphical interface have a look familiar to the average user and are easy to learn. But they are not suitable for solving non-trivial tasks, since they have a limited set of stats. methods and it is impossible to write your own algorithms in them. The mixed type combines the convenience of a GUI shell and the power of programming languages. However, in a detailed comparison of statistical capabilities with the SAS and STATA programming languages, both R and MatLab lose (comparison of statistical methods R, MatLab, STATA, SAS, SPSS). In addition, the licenses for these programs will have to pay a decent amount of money, and the only free alternative is Rcmdr: a frontend for R with a GUI (Rcommander).

Comparison of R with the MatLab, Python, and Julia programming languages

Among the programming languages used in statistical calculations, the leading positions are occupied by R and Matlab. They are similar to each other, both in appearance and functionality; but they have different user lobbies, which determines their specifics. Historically, MatLab has been focused on the applied sciences of engineering, so its strengths are mat. simulation and calculation, and it's much faster than R! But since R was developed as a narrow-profile language for statistical data processing, many experimental stat. methods appeared and were fixed in it. This fact and zero cost made R an ideal platform for the development and use of new packages for the basic sciences.

Other "competing" languages are Python and Julia. In my opinion, Python, being a universal programming language, is more suitable for data processing and information gathering using web technologies than for statistical analysis and visualization (the main differences between R and Python are well described). But the statistical language Julia is a rather young and pretentious project. The main feature of this language is the computational speed, which in some tests exceeds R by 100 times! While Julia is at an early stage of development and has few additional packages and followers, in the long term, Julia is perhaps the only potential competitor to R.

Conclusion

Thus, R is currently one of the leading statistical tools in the world. It is actively used in genetics, molecular biology and bioinformatics, environmental sciences (ecology, meteorology) and agricultural disciplines. R is also increasingly used in medical data processing, displacing commercial packages such as SAS and SPSS from the market.

Advantages of the R environment:

free and cross-platform;
rich arsenal of stat. methods;
high-quality vector graphics;
more than 7000 checked packages;
flexible to use:
- allows you to create / edit scripts and packages,
- interacts with other languages such as C, Java and Python,
- can work with data formats for SAS, SPSS and STATA;
active community of users and developers;
regular updates, good documentation and tech. support.

Disadvantages:

a small amount of information in Russian (although several training courses and interesting books have appeared over the past five years);
relative difficulty in use for a user unfamiliar with programming languages. This can be partially mitigated by working in the Rcmdr GUI shell, which I wrote about above, but for non-standard solutions, you still need to use the command line.

List of useful sources

Official website: http://www.r-project.org/
Starter site: http://www.statmethods.net/
One of the best reference books: The R Book, 2nd Edition by Michael J. Crawley, 2012
List of available literature in Russian + good blog

Programming on R. Level 1. Basics

The R language is the world's most popular statistical data analysis tool. It contains the widest range of capabilities for data analysis, visualization, as well as the creation of documents and web applications. Looking to master this powerful language with an experienced mentor? We invite you to the course "Programming in the R language. Level 1. Basic knowledge".

This course is intended for a wide range of professionals who need to look for patterns in a large amount of data, visualize them and build statistically correct conclusions: sociologists, clinical trial managers / pharmacologists, researchers (astronomy, physics, biology, genetics, medicine, etc.) , IT analysts, business analysts, financial analysts, marketers. The course will also appeal to specialists who do not fit the functionality (or paid) /.

In the classroom, you will gain basic skills in analyzing and visualizing data in the environment R... Most of the time is devoted to hands-on exercises and working with real data sets. You will learn all the new tools for working with data and learn how to apply them in your work.

After the course, a certificate of advanced training of the center is issued.