Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b), Slides of Geology

Statistics - the scientific study of numerical data based on natural phenomena Data - Individual observations/measurements taken on the smallest sampling unit (include multiple observations derived empirically (more than one datum)). Numerical - observations are quantifiable

Typology: Slides

Pre 2010

Available from 08/28/2023

andi-anriansyah
andi-anriansyah 🇮🇩

11 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
standardization
A collection of data can be converted to standardized or
unitless form
Performed by: subtracting from each observation the
mean of the data set and dividing by standard deviation.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b) and more Slides Geology in PDF only on Docsity!

standardization

  • (^) A collection of data can be converted to standardized or

unitless form

  • (^) Performed by: subtracting from each observation the

mean of the data set and dividing by standard deviation.

CLUSTER ANALYSIS

  • (^) First used by Tryon (1939 )
  • (^) Cluster analysis methods are mostly used when we do not

have any a priori hypotheses

  • (^) The purpose is to classify complex multivariate data

through some means of of reduction in its dimension.

  • (^) Its actually encompasses a number of different

classification algorithms.

  • (^) In general, whenever one needs to classify a "mountain" of

information into manageable meaningful piles, cluster

analysis is of great utility.

Joining (Tree Clustering)

  • (^) The purpose of this algorithm is to join together objects

(e.g., animals) into successively larger clusters, using some

measure of similarity or distance.

  • (^) A typical result of this type of clustering is the hierarchical

tree ( called as Dendrogran )

Joining (Tree Clustering)

Distance Measures

  • (^) The joining or tree clustering method uses the similarity/

dissimilarities or distances between objects when forming

the clusters. These distances can be based on a single

dimension or multiple dimensions.

  • (^) Euclidean distance. This is probably the most commonly

chosen type of distance. It simply is the geometric distance in

the multidimensional space. It is computed as: distance(x,y)

i

(x

i

  • y i

2

½

  • (^) Squared Euclidean distance. One may want to square the

standard Euclidean distance in order to place progressively

greater weight on objects that are further apart. This distance

is computed as (see also the note in the previous paragraph):

distance(x,y) =

i

(x

i

  • y i

2

Joining (Tree Clustering)

Distance Measures

  • (^) Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|x i - y i
  • (^) Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = ( i |x i - y i

p ) 1/r , where r and p are user-defined parameters

  • (^) Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of x i y i )/ i

Joining (Tree Clustering)

Distance Measures

Similarity measures for presence/absence data

1. Jaccard similarit y. A match is counted for all taxa with
presences in both samples. Using M For number of matches
and N for the the total number of taxa with presences in just
one sample, we have.

Jaccard similarity = M / (M+N)

2. Dice (Sorensen) coefficien t. Puts more weight on joint
occurences (M) than on mismatches.

Dice similarity = 2M / (2M+N)

Joining (Tree Clustering)

Distance Measures

Similarity measures for abundance data

1. The Euclidean distance : (see above)

2. C orrelation (of the variables along rows) using Pearson’s

r

3. C orrelation using Spearman’s rho (basically the value of

the ranks).

4. B ray-Curtis distance measure, sensitive to absolute

abundances.

Joining (Tree Clustering)

Distance Measures

Similarity measures for abundance data

5. Chord distance for abundance data. This index is

sensitive to species proportions and not to absolute

abundances. It projects the two multivariate sample

vectors onto a hypersphere and measures the distance

between these points, thus normalizing abundances to 1.

6. Morisita’s similarity index for abundance data.

Joining (Tree Clustering)

Amalgamation or Linkage Rules

  • (^) Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages.  mean linkage
  • (^) Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.

Joining (Tree Clustering)

Amalgamation or Linkage Rules

  • (^) Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.
  • (^) Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair- group method using the centroid average

Two-way Joining (Block Clustering)

  • (^) The other types of analyses the research question of

interest is usually expressed in terms of cases

(observations) or variables. Here we can clustering both

observation and variables.

  • (^) Two-way joining is useful in (the relatively rare)

circumstances when one expects that both cases and

variables will simultaneously contribute to the uncovering

of meaningful patterns of clusters.

Two-way Joining (Block Clustering)

K-means Clustering

  • (^) Computations
    • (^) Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse". The program will start with k random clusters, and then move objects between those clusters with the goal to (1) minimize variability within clusters and (2) maximize variability between clusters. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.
  • Interpretation of results
    • Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.