STIN300 Statistical Programming in R

Credits (ECTS):5

Course responsible:Jon Olav Vik

Campus / Online:Taught campus Ås

Teaching language:Engelsk, norsk

Limits of class size:150

Course frequency:Annually

Nominal workload:Lectures/exercises 60 hours. Individual studies 65 hours.

Teaching and exam period:This course starts in the January block. This course has teaching/evaluation in the January block

About this course

In this intensive course you will use the R programming language to apply your statistical skills to scientific data. You should have prior programming experience or else be prepared to put in a lot of effort, see "recommended prerequisites".

Most course participants are MSc or PhD students and thus have chosen a research topic. Use your actual data if you can, or else ask your supervisor for an illustrative sample of similar data. If you haven't chosen a research topic yet, you may borrow someone else's data, or fall back on the previous years' final assignment on classifying influenza viruses based on their nucleotide sequences.

You will write an R markdown report which covers the following:

  • Introduction: Describe the real-world phenomenon that you study and concisely state your research question. Briefly describe the origin of their data, making clear how the measurements relate to your real-world topic.
  • Data import: Use R to get your data from the file(s) into R data structures. Convert data types if necessary, so that numbers are not misinterpreted as text, categorical variables are coded as R "factors", TRUE/FALSE values are represented as such, etc.
  • Outline of data structure: Use R to explore and describe the size and structure of your data (number of variables, number of samples, data types, what possible values the categorical variables can take, etc.).
  • Data visualization: Design and implement at least one data graphic that provides a useful overview of your data or answers a research question. Describe in words what the data shows and interpret what it means.
  • Statistical analysis: Choose a suitable statistical model or procedure that clarifies some pattern or relationship relevant to your research question. Implement it using R. Translate the results back to real-world terms. Discuss what the results mean.

The resulting report will be fully reproducible with executable code. It should be a helpful starting point for your future work, and will facilitate discussions with your supervisor.

Think of all course activities as leading towards this final assignment. You can learn from free online textbooks, daily tutorial documents with some screencasts, and by asking good questions in Discussions on the Canvas course page.

The tutorial documents comprise an introduction to R scripting, with focus on the use of the tidyverse packages ggplot2 and dplyr. Emphasis is on visualization and structuring and manipulation of data in a table format. Later, we visit topics like operators, variables, data types and basic data structures, control structures (loops, conditionals), more general handling of files and text, and user-defined functions.

Participants are expected to contribute actively to the learning collective, both in posting questions/topics in Canvas Discussions and in plenary class sessions. Asking effective questions with reproducible examples is a key skill which you will learn throughout the course.

Learning outcome

Upon completion of the course the students should be capable of performing statistical analyses using a programming approach in R. The students should be able to visualize and manipulate data and make their own functions utilizing/modifying available functions in order to solve specific statistical problems. The students should also be able to present the output from statistical analyses in an accessible and scientific form using text and graphics.

KNOWLEDGE: Students will acquire

  • an understanding of how programming can automate demanding statistical computations.
  • a working knowledge of concepts, syntax and conventions for describing, fitting and interpreting statistical models in R.

SKILLS: Students will be able to

  • interpret output from R's functions for statistical modelling, such as lm().
  • read in data from various file formats including Excel, comma-separated text, and FASTA.
  • develop their own functions which use existing functions, to solve nontrivial challenges more efficiently than by nonstructured programming.
  • present results of statistical analysis in a scientific, clear form through reproducible, executable reports which weave together expository text, program code, and output such as tables and graphics.
  • troubleshoot problems by locating errors, reproducing them on a small subset of the data, step through code line by line, etc.
  • orient themselves in documentation for R packages that implements statistical methods the student knows.

GENERAL COMPETENCES: Students will be well prepared to apply statistical methods in R on datasets they encounter in later studies and working life. This includes loading data into R, transforming it to a structure that the analysis function can use, run analyses with appropriate settings, and interpret and present the results in a form that is useful to the end user.

  • Teachers are available in real-time plenary sessions the first half of each day, and students work on their own or in self-organized groups in the afternoon. Divide your attention between planning the report on your own data, and studying tutorials, textbooks and R documentation to acquire the necessary skillsets.

    The first two sessions we'll be "live coding" together, to get everyone up to speed in the RStudio programming environment. Then the course "flips", and the live sessions will focus on questions or topics brought up by students in the form of Discussion posts in Canvas.

    You will get advice on how to formulate effective questions, which is a key skill because it 1) helps others help you and 2) helps you help you. This will be a recurring theme throughout the course.

  • The Canvas course pages link to daily tutorial documents, various howtos and free online textbooks.

    Most R functions has extensive documentation and runnable examples. You will learn to navigate the R help system, walk yourself through the examples, and relate them to your own problems.

    Online forums such as Stack Overflow are a rich source of support. You will learn to search existing answers, and how to describe problems clearly enough that others can help.

    Ask questions in Discussions in Canvas. They will be answered, either there or in plenary discussion.

    Teachers are available in the plenary sessions every day until noon.

    Topics raised in Discussion posts in Canvas will be addressed in next day's plenary session for discussion and reflection. Students are expected to participate actively, reflecting on the problem-solving process as well as helping each other out.

  • Statistics equivalent to C in STAT100. You are expected to be acquainted with simple linear regression and analysis of variance.

    You are expected to be familiar with your file system, your keyboard, your web browser and your computer.

  • Pass/fail based on quizzes and the data analysis report on data from your own field of research, all of which must be approved. Approved quizzes are valid only within the current semester.
  • An external examiner must approve the evaluation arrangements for the course.
  • Pass/fail based on quizzes and the data analysis report on data from your own field of research, all of which must be approved. Approved quizzes are valid only within the current semester.
  • Students must bring their own laptop with Windows, Linux or macOS 11 or higher to run the computer programs we use. (See current system requirements.) Chromebooks cannot be used because they do not meet the system requirements.
  • Lectures/interactive computer lab 4 hours daily in three weeks.
  • M-BIAS
  • Passed / Not Passed
  • Special requirements in Science