Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b), Slides of Geology

Institut Teknologi Bandung (ITB)Geology

Statistics - the scientific study of numerical data based on natural phenomena Data - Individual observations/measurements taken on the smallest sampling unit (include multiple observations derived empirically (more than one datum)). Numerical - observations are quantifiable

Typology: Slides

Pre 2010

Available from 08/28/2023

andi-anriansyah 🇮🇩

11 documents

1 / 19

This page cannot be seen from the preview

Don't miss anything!

standardization

•A collection of data can be converted to standardized or

unitless form

•Performed by: subtracting from each observation the

mean of the data set and dividing by standard deviation.

Partial preview of the text

Download STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b) and more Slides Geology in PDF only on Docsity!

standardization

(^) A collection of data can be converted to standardized or

unitless form

(^) Performed by: subtracting from each observation the

mean of the data set and dividing by standard deviation.

CLUSTER ANALYSIS

(^) First used by Tryon (1939 )
(^) Cluster analysis methods are mostly used when we do not

have any a priori hypotheses

(^) The purpose is to classify complex multivariate data

through some means of of reduction in its dimension.

(^) Its actually encompasses a number of different

classification algorithms.

(^) In general, whenever one needs to classify a "mountain" of

information into manageable meaningful piles, cluster

analysis is of great utility.

Joining (Tree Clustering)

(^) The purpose of this algorithm is to join together objects

(e.g., animals) into successively larger clusters, using some

measure of similarity or distance.

(^) A typical result of this type of clustering is the hierarchical

tree ( called as Dendrogran )

Joining (Tree Clustering)

Distance Measures

(^) The joining or tree clustering method uses the similarity/

dissimilarities or distances between objects when forming

the clusters. These distances can be based on a single

dimension or multiple dimensions.

(^) Euclidean distance. This is probably the most commonly

chosen type of distance. It simply is the geometric distance in

the multidimensional space. It is computed as: distance(x,y)

(x

(^) Squared Euclidean distance. One may want to square the

standard Euclidean distance in order to place progressively

greater weight on objects that are further apart. This distance

is computed as (see also the note in the previous paragraph):

distance(x,y) =

(x

Joining (Tree Clustering)

Distance Measures

(^) Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|x i - y i

(^) Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = ( i |x i - y i

p ) 1/r , where r and p are user-defined parameters

(^) Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of x i y i )/ i

Joining (Tree Clustering)

Distance Measures

Similarity measures for presence/absence data

1. Jaccard similarit y. A match is counted for all taxa with

presences in both samples. Using M For number of matches

and N for the the total number of taxa with presences in just

one sample, we have.

Jaccard similarity = M / (M+N)

2. Dice (Sorensen) coefficien t. Puts more weight on joint

occurences (M) than on mismatches.

Dice similarity = 2M / (2M+N)

Joining (Tree Clustering)

Distance Measures

Similarity measures for abundance data

1. The Euclidean distance : (see above)

2. C orrelation (of the variables along rows) using Pearson’s

r

3. C orrelation using Spearman’s rho (basically the value of

the ranks).

4. B ray-Curtis distance measure, sensitive to absolute

abundances.

Joining (Tree Clustering)

Distance Measures

Similarity measures for abundance data

5. Chord distance for abundance data. This index is

sensitive to species proportions and not to absolute

abundances. It projects the two multivariate sample

vectors onto a hypersphere and measures the distance

between these points, thus normalizing abundances to 1.

6. Morisita’s similarity index for abundance data.

Joining (Tree Clustering)

Amalgamation or Linkage Rules

(^) Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages.  mean linkage
(^) Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.

Joining (Tree Clustering)

Amalgamation or Linkage Rules

(^) Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.
(^) Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair- group method using the centroid average

Two-way Joining (Block Clustering)

(^) The other types of analyses the research question of

interest is usually expressed in terms of cases

(observations) or variables. Here we can clustering both

observation and variables.

(^) Two-way joining is useful in (the relatively rare)

circumstances when one expects that both cases and

variables will simultaneously contribute to the uncovering

of meaningful patterns of clusters.

Two-way Joining (Block Clustering)

K-means Clustering

(^) Computations
- (^) Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse". The program will start with k random clusters, and then move objects between those clusters with the goal to (1) minimize variability within clusters and (2) maximize variability between clusters. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.
Interpretation of results
- Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.

STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b), Slides of Geology

Related documents

Partial preview of the text

Download STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b) and more Slides Geology in PDF only on Docsity!

standardization

unitless form

mean of the data set and dividing by standard deviation.

CLUSTER ANALYSIS

have any a priori hypotheses

through some means of of reduction in its dimension.

classification algorithms.

information into manageable meaningful piles, cluster

analysis is of great utility.

Joining (Tree Clustering)

(e.g., animals) into successively larger clusters, using some

measure of similarity or distance.

tree ( called as Dendrogran )

Joining (Tree Clustering)

dissimilarities or distances between objects when forming

the clusters. These distances can be based on a single

dimension or multiple dimensions.

chosen type of distance. It simply is the geometric distance in

the multidimensional space. It is computed as: distance(x,y)

(x

standard Euclidean distance in order to place progressively

greater weight on objects that are further apart. This distance

is computed as (see also the note in the previous paragraph):

distance(x,y) =

(x

Joining (Tree Clustering)

Joining (Tree Clustering)

Similarity measures for presence/absence data

1. Jaccard similarit y. A match is counted for all taxa with

presences in both samples. Using M For number of matches

and N for the the total number of taxa with presences in just

one sample, we have.

2. Dice (Sorensen) coefficien t. Puts more weight on joint

occurences (M) than on mismatches.

Joining (Tree Clustering)

Similarity measures for abundance data

1. The Euclidean distance : (see above)

2. C orrelation (of the variables along rows) using Pearson’s

r

3. C orrelation using Spearman’s rho (basically the value of

the ranks).

4. B ray-Curtis distance measure, sensitive to absolute

abundances.

Joining (Tree Clustering)

Similarity measures for abundance data

5. Chord distance for abundance data. This index is

sensitive to species proportions and not to absolute

abundances. It projects the two multivariate sample

vectors onto a hypersphere and measures the distance

between these points, thus normalizing abundances to 1.

6. Morisita’s similarity index for abundance data.

Joining (Tree Clustering)

Joining (Tree Clustering)

Two-way Joining (Block Clustering)

interest is usually expressed in terms of cases

(observations) or variables. Here we can clustering both

observation and variables.

circumstances when one expects that both cases and

variables will simultaneously contribute to the uncovering

of meaningful patterns of clusters.

Two-way Joining (Block Clustering)

K-means Clustering