% Script 2 for Week 5 % A Bit of Statistics % Many processes in the world are aggregate phenomena composed of many tiny contributions each of which is random in some sense. Still, the aggregate (such as temperature or pressure) can be very predictable and modeling to obtain this prediction will in many cases deal with that random component directly. And Monte Carlo Simulation (Stanislaw Ulam, Los Alamos, 1946) techniques can help us make predictions even when there is no randomness involved in the underlying situation. % Example: Estimate pi using a pseudorandom number generator. rng('shuffle') % assign time-based seed to random number generator L=rand(1000000,2); % One million randomly selected points in the unit square L(347,:) M=L(:,1).^2+L(:,2).^2; % M is the square of the distance from each point to the origin length(find(M<1))/length(M) % should be close to the ratio of the area of the circle segment to the area of the square. pi/4 % This is a 50 by 1 matrix of random integers between 0 and 100. L=round(100*sort(rand(50,1))) randn(7,1) % Below is a list of 200 random integers using a uniform distribution on integers between 7 and 98. L=randi([7 98],200,1) find(L==30) L=sort(L) mean(L) median(L) mode(L) % Just gives the first mode if L is a multimodal data set. mode(L(1:40)) find(L==mode(L)) find(L==mode(L(1:40))) L(find(L==mode(L))) std(L) var(L) % Ignore the next few lines if you don't do much with statistics. Here I recreate ``by hand'' some functions that are built into MATLAB for the normal distribution (as well as many other distributions). The built-in functions are MUCH faster, and we may revisit this if student interest in statistics, and time, allows. Z=(L-mean(L))./std(L); % Z-score: standardize L so that it has mean 0 and st. dev. 1, for ease of comparison to a normal distribution. % The normal distribution, the ubiquitous ``bell curve'', has some interesting properties. The bell curve function with mean m and standard deviation sd is given below as norpdf. One of the first thing people learn about it is the 68-95-99.7 percent rule. On a normal distribution these percents represent the approximate fraction of the area under the curve within 1, 2 or 3 standard deviations from the mean, respectively. Interpretation: the next time a number is ``selected'' governed by this distribution there is a 68% chance it will be within 1 standard deviation of the mean. If these percents are not close to correct for your distribution, or if the distribution is not symmetric, then your distribution is NOT (close to) normal. % (Note: the following three functions are built-into MATLAB for the normal distribution, as well as their analogues for many other distributions. We construct them ``by hand'' here.) norpdf=@(x,m,sd) 1/(sd*sqrt(2*pi))*exp(-(x-m).^2/(2*sd^2)) intnormpdf=@(a,b,m,sd) integral(@(x)norpdf(x,m,sd),a,b) invpdf=@(p) fminsearch(@(x)(intnormpdf(-20,x,0,1)-p).^2,0) for i=2:(length(Z)-1) plotnorm(i-1)=invpdf(i/length(Z)); end plot(Z(2:(length(Z)-1)),plotnorm,[-2;2],[-2;2]) % There should be a certain fraction of the data ``to the left'' at each point on a normal distribution. If this line is nearly straight that fraction is nearly correct at each point in the data, a good test for normality. % You can also see if the data is nearly normal by looking at a histogram. Does it "look" like a bell curve? histogram(L) histogram(L, [0,10,30,80,100]) histogram(L, 7) histogram(L,min(L):max(L)) histogram(L,min(L):2:max(L)) fivenum(L) J=[-10 0 4 5 6 7 5 4 5 6 12 15] bplot(L,'nopoints') bplot(L,'horizontal') bplot(L,'horizontal','std','nodots') bplot(J,'outliers') fourplot(L) fourplot(J) dotplot(L) dotplot(J)