English with Korean: statistics

Showing posts with label statistics. Show all posts

Thursday, 18 April 2013

Probability : Binomial

오늘의 개념

"Binomial

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of sizen drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution is a good approximation, and widely used.

** http://en.wikipedia.org/wiki/Binomial_distribution

Example(1) :Flip 2 coins

Let Y= total # heads

P(Y=0)=1/4
P(Y=1)=1/2 (<- 요것들을 probability distribution 이라고 부릅니다.)
P(Y=2)=1/4

Binomial(2,1/2) : 해석- we give 2 experiments in this case that put the coin twice, and probability or head each time equal to a half (1/2)

Example(1) :Flip 1 coin

Let Y= total # heads

P(Y=0)=1/2
P(Y=1)=1/2

Binomial(1,1/2) => 특별한 경우, 한번의 기회, 확률은 반반. (Bernoulli distribution)

Example(1) :Flip 3 coins (three times)

Let Y= total # heads

P(Y=0)=1/8
P(Y=1)=3/8
P(Y=2)=3/8
P(Y=3)=1/8

Binominal(3,1/2)

** Advanced concept**

Binomial Probability Formula

A probability formula for Bernoulli trials. The probability of achieving exactly k successes in n trials is shown below.

Formula:	n = number of trials k = number of successes n – k = number of failures p = probability of success in one trial q = 1 – p = probability of failure in one trial
Example:	You are taking a 10 question multiple choice test. If each question has four choices and you guess on each question, what is the probability of getting exactly 7 questions correct? n = 10 k = 7 n – k = 3 p = 0.25 = probability of guessing the correct answer on a question q = 0.75 = probability of guessing the wrong answer on a question

source: http://www.mathwords.com/b/binomial_probability_formula.htm

Thursday, 11 April 2013

Statistics: Making sense of data : scatter plot &correlations

Scatter plot

Statisticians and quality control technicians gather data to determine correlations (relationships) between such events. Scatter plots will often show at a glance whether a relationship exists between two sets of data ( In this lecture Quantitative valuables).

Let's decide if studying longer will affect Regents grades based upon a specific set of data. Given the data below, a scatter plot has been prepared to represent the data. Remember when making a scatter plot, do NOT connect the dots.

Study Hours	Regents Score
3	80
5	90
2	75
6	80
7	90
1	50
2	65
7	85
1	40
7	100

Notice: Certain values may have more than one result,
such as (7,90) and (7,85) and (7,100).

The data displayed on the graph resembles a line rising from left to right. Since the slope of the line is positive, there is a positive correlation between the two sets of data. This means that according to this set of data, the longer I study, the better grade I will get on my Regents examination.

Note: Just because this set of data showed a positive correlation does not mean that the relationship is positive for all sets of data concerning study time and Regents scores. There may be sets of data that show that there is NOT a positive correlation between hours studying and better Regents scores.
It all depends on the data being examined.

If the slope of the line had been negative (falling from left to right), a negative correlation would exist since the slope of the line would have been negative. Under a negative correlation, the longer I study, the worse grade I would get on my Regents examination. YEEK!!

If the plot on the graph is scattered in such a way that it does not approximate a line (it does not appear to rise or fall), there is no correlation between the sets of data. No correlation means that the data just doesn't show if studying longer has any affect on Regents examination scores.

Check out these graphs for visual interpretations of types of correlations:

The points are clustered as to resemble a rising straight line with a positive slope.

While the points "tend" to be rising, it is not a clearly positive relationship since points are not clustered as to show a clear straight line.

The points are clustered as to resemble a falling straight line with a negative slope.

While the points "tend" to be falling, it is not a clearly negative relationship since points are not clustered as to show a clear straight line.

There is no way of determining from these points, if the pattern is rising or falling. There is no evidence of a straight line.

Warning!!

Correlation does not necessarily mean Causation.
Just because there is a strong correlation between data, does not necessarily mean that one set of data is causing the affect that is occurring in the other set of data.

During the months of February and March, the weekly number of jars of strawberry jam sold at a local market in New York was recorded. For the same time frame, the number of copies of a popular classical music CD sold in Florida was recorded. The data was examined and was plotted

From looking at the graph, it can be seen that there is a high positive correlation between these two sets of data.

So, this must mean that the number of jars of strawberry jam sold in New York was causing an increase in the number of classical music CDs sold in Florida. Of course this is not true!

Always be careful what you infer from your statistical analyses. Be sure the relationship makes sense. Also keep in mind that other factors may be involved in a cause-effect relationship.

resource: http://www.regentsprep.org/Regents/math/ALGEBRA/AD4/scatter.htm

Weekly Data Collection
The jars of strawberry jam sold in New York	The number of CDs sold in Florida
5 jars	25 CDs
7	30
9	35
10	42
11	48
11	52
12	56

Statistics: Making sense of data : 2.2 Examining Relationships Between Two Categorical Variables

Examining Relationships Between Two Categorical Variables

1. Distribution = the pattern of values in the data for that variable, showing the frequency of occurrence of the values relative to each other

2. Categorical variable = this is given by the frequencies of relatives frequencies of the observation for each of the categories of the variable.

3. Joint distribution = the frequency or relative frequency of the observattions for the two variables considered together as a combination.
= Contingency table or Cross-tabulation or Two-way table

4. Marginal distribution = the distribition of only one of the variables in a contingency table.
=> we can say the total column on the chart is the marginal disribution.

5. Conditional distribution = of a categorical variable is its distribution within a fixed value of a second variable.

Statistics: Making sense of Data : Box plot 2

Through the previous lecture,we've learn a simplified box plow. Today, We're gonna get throught it in detail and after then, we would get the way to represent any dataset more precisely. Some concepts in statistics give you more easy way to get used to it.

1. IQR (Interquatile range)
: It represents the range from the first quatile (25% of dataset from minimum) over to the third quatile (75% of dataset).
: 3rd Q - 1st Q

2. Lower innerfence / Upper innerfence
: To remove the outlier to make the figure precisely
: L.I = 1st Q- 1.5*IQR
: U.I= 3rd Q+1.5*IQR

3. Mean
: It means an average value of dataset.
: After adding up all together from X subscript 1 until X subscrip n,which means the number of data set and devide by the n
: Total value of data / The number of data

Monday, 1 April 2013

Statistics: Making sense of Data : Box plot

From today, I start to study statistics through coursera.org. I am pretty sure many people already know about the website. The website offers a variety of university programs to public for free. You can choose any subject which you are interested in or you are already studying in your class. It is very useful and helpul for people who want to gain college level courses from prominent universities throughout the world.

My first choice among these courses is 'Statistics: Making sense of Data'

To resume my story, the first device I learned today was 'Boxplot'. I was used to the chart because I often saw the type of graph when I was at work. But I had no idea of what the chart was. Finally I got its meaning of figure. Box plot represents 5 certain points out of a data set, which are

1. maximum

2. the third quartile (75%)

3. median

4. the first quartile (25%)

5. minimum

I found an awesome 2 blogs which explain...

How to draw 'boxplot' with Microsoft Excel
1. http://blog.naver.com/dev000?Redirect=Log&logNo=110045317043

How it is used in our life.
2. http://nelsontouchconsulting.wordpress.com/2011/01/07/behold-the-box-plot/

Pages