(Also known as: how to get the most from your data without making a fool of yourself.)
Term:
September 2011
Lecturer: Scott
Oser
Class coordinates: Tuesdays/Thursdays, 9:3011:00
in Hennings 302
Office Hours: Tuesdays 121pm, or by
appointment
TA: Shimpei Tobayama
Topics covered: Interpretation of probability; basic descriptive statistics; common probability distributions; Monte Carlo methods; Bayesian analysis; methods of error propagation; systematic uncertainties; parameter estimation; hypothesis testing and statistical significance; confidence intervals; blind analyses; methods of multivariate analysis; nonparametric tests; periodicity searches; "robust" statistics; deconvolution and unfolding
Prerequisites: Officially, none. However, you will be expected to have some facility with computational techniques and programming in a highlevel language, or at least a willingness to learn very quickly. Quite simply, it's not possible to do much data analysis or statistics without being able to program. Almost all homework assignments will have a large computational component, although this class will not teach programming per se. If you don't already know basic computational physics, your time might be better spent taking Physics 210 or Physics 410 instead.
Textbooks: There are two textbooks for this course:
Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences, by Roger Barlow
Bayesian Logical Data Analysis for the Physical Sciences, by P.C. Gregory
Each has a different focus with different strengths and weaknesses, and I'll draw on material from both.
Supplemental material: You may also find these books enlightening:
Numerical Recipes, by William H. Press (available online here)
Statistical Data Analysis, by Glen Cowan
Practical Statistics for Astronomers, by J.V. Wall and C.R. Jenkins
Probability and Statistics, by Morris H. DeGroot
Your grade will be determined by:
Homework 
60% 
Midterm 
20% 
Final Exam 
20% 
Homework: There will be
approximately five lengthy homework assignments. They will generally
include analytic calculations, essaytype questions, and
computational problems that will require you to analyze data sets and
most usually to write some computer code to do so. You are welcome to
discuss problems informally with your classmates. However, you must
complete the assignment yourself, and if you hand in obviously copied
homework, you should expect a mark of zero on that assignment, or
worse. Assignments are due by the end of class on the day they are
due. I will give more guidelines in class for how to submit completed
assignments.
Useful software: This course will require some computational facility on your part. The entire course can be done using free software, and you're not required to buy anything. The most important things you'll need are access to a good plotting package and a library of scientific routines (capable of random number generation, nonlinear fitting, and matrix operations at a minimum). I encourage you to use whatever tools your field uses or that you already know, but if you want some recommendations, you may find the following useful:
ROOT: a combined plotting/analysis package developed by the high energy physics community (but of general utility), based on a C++ interpreter. Extremely powerful, with decent tutorials available. Free. Includes most numerical routines you might want, and since it's based on C++ it can work with other libraries or code as well.
gnuplot: a free plotting package with some basic fitting capability (although not enough to do every HW problem). This might be a good option if you're writing standalone code in C/C++/FORTRAN and just need a way of plotting the output.
Mathematica: an integrated plotting and mathematical analysis package. Quite expensive (prohibitively so if you're not a student). A single user license for an old, slow version of Mathematica 4 is available on the main physics server.
GNU Scientific Library: a free library of computational routines. To some extent it is a freeware equivalent of the routines in Numerical Recipes. Available in C and C++.
Numerical Recipes: Very commonly used. Although the text of the book is available online for free, the routines are proprietary, and you're supposed to buy the book if you use any of the routines. Available in C, C++, and FORTRAN.
Programming languages: It's up to you to choose what programming language you feel most comfortable with. These days I'd generally recommend C++, as it is quickly coming to dominate many areas of the physical sciences, and most libraries of scientific routines are available in that language. But I will confess that I am personally still much more fluent in FORTRAN and C. If you want to use something besides C++, C, or FORTRAN, please be my guest, but I won't be able to offer you much specific advice on coding issues. Not that I will anyway.
Missed exams: There will be
one inclass midterm exam. If you miss the exam with a legitimate
excuse (proof of illness, family emergency, etc), see me to discuss
makeup options.
Religious holidays: Students
are entitled to request an alternate test date if a scheduled test
date falls on one of their holy days. If you think this may apply to
you, please contact me as soon as possible to make an alternate
arrangement. Please don't put this off until the last minuteyou
must give at least two week's notice.
FINAL EXAM:
The final exam is not yet
scheduled. A takehome exam is a likely possibility.
Syllabus: A tentative
lecture schedule follows. It will almost certainly be adjusted as the
course proceeds.
Lecture # 
Date 
Topics Covered 
Reading Material 
Assignment Due (tentative) 
9/6 
First day of class. Introduction; Interpretations of probability 
B7.1; G1.11.4 



9/8 
NO CLASS  TA training day 


9/13 
Basic descriptive statistics; random variables; Gaussian and binomial distributions 
B2.12.6; B3.13.2 


9/15 
Poisson, exponential, and chi^2 distributions; mathematics of manipulating probability distributions 
B3.33.5 


9/20 
Monte Carlo and basic computational methods: random number generation, minimization routines, coding hints 
B10.110.4 

9/22 
Intro to Bayesian analysis: general principles, basic applications, contrast with frequentist approach, nuisance parameters and systematic uncertainties 
G Ch 34 



9/27 
NO CLASS  INSTRUCTOR TO BE ABDUCTED BY ALIENS 



9/29 
NO CLASS  INSTRUCTOR TO BE ABDUCTED BY ALIENS 


10/4 
Bayesian analysis: choice of priors, maximum entropy principles 
G Ch 4, 8 


10/6 
The central limit theorem; the Chebyshev limit; covariance matrices and multidimensional Gaussian distributions 
B4.14.4 


10/11 
Estimators I: introduction & maximum likelihood method 
B5.15.4 

10/13 
Estimators II: least squares methods 
B5.55.6, B6.16.7 


10/18 
Error propagation methods: meaning and interpretation of error bars; the error propagation equation; dealing with correlations; handling asymmetric and nonGaussian errors 
B4.14.4 



10/20 
INCLASS MIDTERM EXAM 


10/25 
Systematic Uncertainties I: distinction or lack thereof between statistical and systematic uncertainties; Monte Carlo evaluation; covariance matrix approach 
B4.14.4 


10/27 
Systematic Uncertainties II: the pull method/"floating systematics", how to evaluate systematics; common mistakes in systematic error propagation 
B4.14.4 


11/01 
Hypothesis/significance testing I: introduction, interpretation, significance and power, NeymanPearson lemma; trials factors 
B8.18.2.2 


11/03 
Hypothesis/significance testing II: likelihood ratio test, goodness of fit, KolmogorovSmirnov tests, the twosample problem and the ttest 
B8.2.38.4 


11/08 
DISCUSSION DAY 


11/10 
Bayesian analysis: Numerical methodsLaplace's approximation, methods of marginalizing over nuisance parameters, numerical integration, Markov Chain Monte Carlo and the MetropolisHastings algorithm 
G1112 


11/15 
Confidence regions: Bayesian and frequentist interpretations; nonphysical regions; FeldmanCousins confidence intervals 
B7.2, this paper 


11/17 
Multivariate analysis: linear Fisher discriminants; likelihood ratio approximations; decision trees; machine learning 
class notes 

11/22 
DISCUSSION DAY. Also, please read the attached notes and paper on blind analyses. 


11/24 
Nonparametric tests: sign test for the median; the MannWhitney test; matched pairs; Spearman's correlation coefficient; run tests 
B8.3.28.3.3, B9.19.3 


11/29 
Robust methods of parameter estimation; bootstrap method 
Numerical Recipes 15.7; class notes 


12/01 
Deconvolution and unfolding LAST DAY OF CLASS 
class notes; see also supplemental text Cowan, Ch 11. 


Periodicity studies BONUS NOTES

G Appendix B, G Ch 13, + this paper 

Scott Oser (email me) June 22, 2011