Course code STIN300

STIN300 Statistical Programming in R

There may be changes to the course due to to corona restrictions. See Canvas and StudentWeb for info.

Norsk emneinformasjon

Search for other courses here

Showing course contents for the educational year 2021 - 2022 .

Course responsible: Jon Olav Vik
Teachers: Kathrine Frey Frøslie, Torgeir Rhodén Hvidsten
ECTS credits: 5
Faculty: Faculty of Chemistry, Biotechnology and Food Science
Teaching language: EN, NO
(NO=norsk, EN=Engelsk)
Limits of class size:
150
Teaching exam periods:
This course starts in the January block. This course has teaching/evaluation in the January block
Course frequency: Annually
First time: 2010H
Course contents:

In this intensive course you will use the R programming language to apply your statistical skills to scientific data. You should have prior programming experience or else be prepared to put in a lot of effort, see "recommended prerequisites".

Most course participants are MSc or PhD students and thus have chosen a research topic. Use your actual data if you can, or else ask your supervisor for an illustrative sample of similar data. If you haven't chosen a research topic yet, you may borrow someone else's data, or fall back on the previous years' final assignment on classifying influenza viruses based on their nucleotide sequences.

You will write an R markdown report which covers the following:

  • Introduction: Describe the real-world phenomenon that you study and concisely state your research question. Briefly describe the origin of their data, making clear how the measurements relate to your real-world topic.
  • Data import: Use R to get your data from the file(s) into R data structures. Convert data types if necessary, so that numbers are not misinterpreted as text, categorical variables are coded as R "factors", TRUE/FALSE values are represented as such, etc.
  • Outline of data structure: Use R to explore and describe the size and structure of your data (number of variables, number of samples, data types, what possible values the categorical variables can take, etc.).
  • Data visualization: Design and implement at least one data graphic that provides a useful overview of your data or answers a research question. Describe in words what the data shows and interpret what it means.
  • Statistical analysis: Choose a suitable statistical model or procedure that clarifies some pattern or relationship relevant to your research question. Implement it using R. Translate the results back to real-world terms. Discuss what the results mean.

The resulting report will be fully reproducible with executable code. It should be a helpful starting point for your future work, and will facilitate discussions with your supervisor.

Think of all course activities as leading towards this final assignment. You can learn from free online textbooks, daily tutorial documents with some screencasts, and by asking good questions in Discussions on the Canvas course page.

The tutorial documents comprise an introduction to R scripting, with focus on the use of the tidyverse packages ggplot2 and dplyr. Emphasis is on visualization and structuring and manipulation of data in a table format. Later, we visit topics like operators, variables, data types and basic data structures, control structures (loops, conditionals), more general handling of files and text, and user-defined functions.

Participants are expected to contribute actively to the learning collective, both in posting questions/topics in Canvas Discussions and in plenary class sessions. Asking effective questions with reproducible examples is a key skill which you will learn throughout the course.

Learning outcome:

Upon completion of the course the students should be capable of performing statistical analyses using a programming approach in R. The students should be able to visualize and manipulate data and make their own functions utilizing/modifying available functions in order to solve specific statistical problems. The students should also be able to present the output from statistical analyses in an accessible and scientific form using text and graphics.

KNOWLEDGE: Students will acquire

  • an understanding of how programming can automate demanding statistical computations.
  • a working knowledge of concepts, syntax and conventions for describing, fitting and interpreting statistical models in R.

SKILLS: Students will be able to

  • interpret output from R's functions for statistical modelling, such as lm().
  • read in data from various file formats including Excel, comma-separated text, and FASTA.
  • develop their own functions which use existing functions, to solve nontrivial challenges more efficiently than by nonstructured programming.
  • present results of statistical analysis in a scientific, clear form through reproducible, executable reports which weave together expository text, program code, and output such as tables and graphics.
  • troubleshoot problems by locating errors, reproducing them on a small subset of the data, step through code line by line, etc.
  • orient themselves in documentation for R packages that implements statistical methods the student knows.

GENERAL COMPETENCES: Students will be well prepared to apply statistical methods in R on datasets they encounter in later studies and working life. This includes loading data into R, transforming it to a structure that the analysis function can use, run analyses with appropriate settings, and interpret and present the results in a form that is useful to the end user.

Learning activities:

Teachers are available in real-time plenary sessions the first half of each day, and students work on their own or in self-organized groups in the afternoon. Divide your attention between planning the report on your own data, and studying tutorials, textbooks and R documentation to acquire the necessary skillsets.

The first two sessions we'll be "live coding" together, to get everyone up to speed in the RStudio programming environment. Then the course "flips", and the live sessions will focus on questions or topics brought up by students in the form of Discussion posts in Canvas.

You will get advice on how to formulate effective questions, which is a key skill because it 1) helps others help you and 2) helps you help you. This will be a recurring theme throughout the course.

Teaching support:

The Canvas course pages link to daily tutorial documents, various howtos and free online textbooks.

Most R functions has extensive documentation and runnable examples. You will learn to navigate the R help system, walk yourself through the examples, and relate them to your own problems.

Online forums such as Stack Overflow are a rich source of support. You will learn to search existing answers, and how to describe problems clearly enough that others can help.

Ask questions in Discussions in Canvas. They will be answered, either there or in plenary discussion.

Teachers are available in the plenary sessions every day until noon.

Topics raised in Discussion posts in Canvas will be addressed in next day's plenary session for discussion and reflection. Students are expected to participate actively, reflecting on the problem-solving process as well as helping each other out.

Syllabus:

There is no fixed syllabus; what matters is that you achieve the learning objectives in relation to data from your own research field. Whether you learn from daily tutorials, textbooks or self-study of help pages is up to you. The following free online textbooks are all useful reference material:

https://rstudio-education.github.io/hoprhttps://rstudio-education.github.io/hoprHands-on programming with R (essential, particularly if you have little programming experience)https://r4ds.had.co.nzhttps://r4ds.had.co.nzR for data science (essential for its coverage of the "tidyverse" packages)https://ggplot2-book.org/https://ggplot2-book.org/ggplot2: Elegant Graphics for Data Analysis (covers the main data visualization package used in the course)https://clauswilke.com/dataviz/https://clauswilke.com/dataviz/Fundamentals of data visualization (focuses on concepts rather than a particular software package)https://statlearning.com/https://statlearning.com/An introduction to statistical learning (useful reference on statistical methods)

Paper copies of the above books can be bought on Amazon.

Prerequisites:
Statistics equivalent to https://www.nmbu.no/course/STAT100STAT100.
Recommended prerequisites:

Introduction to programming, e.g. STIN100 Biological data analysis or INF120 Programming and data processing.

Statistics beyond introductory level. STIN300 does not primarily teach statistics, but advises you on applying those methods on your own data using the R toolset.

Assessment:
Pass/fail based on quizzes and the data analysis report on data from your own field of research, all of which must be approved. Approved quizzes are valid only within the current semester.
Nominal workload:
Lectures/exercises 60 hours. Individual studies 65 hours.
Entrance requirements:
Special requirements in Science
Type of course:
Lectures/interactive computer lab 4 hours daily in three weeks.
Note:
Students must bring their own laptop with Windows, Linux or macOS 10.13 or higher to run the computer programs we use. (See current system requirements.)
Examiner:
An external examiner must approve the evaluation arrangements for the course.
Examination details: Portforlio: Passed / Failed