What are survival analyses and what is Kaplan-Meier estimate

Survival analyses

The survival function, also known as a survivor function or reliability function, is a property of any random variable that maps a set of events, usually associated with mortality or failure of some system, onto time. It captures the probability that the system will survive beyond a specified time.

The term reliability function is common in engineering while the term survival function is used in a broader range of applications, including human mortality. Another name for the survival function is the complementary cumulative distribution function.

Definition

Let T be a continuous random variable with cumulative distribution function F(t) on the interval [0,∞). Its survival function or reliability function is:

S(t) = P(\{T > t\}) = \int_t^{\infty} f(u)\,du = 1-F(t).

Properties

Every survival function S(t) is monotonically decreasing, i.e. S(u) \le S(t) for all u > t.

The time, t = 0, represents some origin, typically the beginning of a study or the start of operation of some system. R(0) is commonly unity but can be less to represent the probability that the system fails immediately upon operation.

Since the CDF is a right-continuous function and the survival function, S(t) = 1-F(t)., it can be said that the survival function is also right-continuous.

Kaplan-Meier estimates

The Kaplan–Meier estimator,[1][2] also known as the product limit estimator, is an estimator for estimating the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In economics, it can be used to measure the length of time people remain unemployed after a job loss. In engineering, it can be used to measure the time until failure of machine parts. In ecology, it can be used to estimate how long fleshy fruits remain on plants before they are removed by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier.

I realize the source is mostly Wikipedia, but I created the post for my personal needs, hopefully with making a proper post on my blog in the near future.

What can be usefull in measuring powerlaw-type distribution

Because this is about ongoing research, I cannot reveal the exact case of using those statistical methods, but lets explain simply – we have an important dimension in the dataset which can be characterized as a powerlaw-distribution (long tail (at right) and short peak in the beginning (at left)). Because I am still learning statistics, which can be useful more than once, I want to write it down what I learned.

Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator seeks to ascertain the causal effect of one variable upon another—the effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate. To explore such issues, the investigator assembles data on the underlying variables of interest and employs regression to estimate the quantitative effect of the causal variables upon the variable that they in influence. The investigator also typically assesses the “statistical significance” of the estimated relationships, that is, the degree of confidence that the true relationship is close to the estimated relationship.

Double logarithmic transformation – it is a log ( log ( x ) ) transformation. You can read here: http://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va – to find out when it is appropriate to use logarithmic transformation instead of the actual values.

Normal distribution – the normal distribution is immensely useful because of the central limit theorem, which states that, under mild conditions, the mean of many random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution: physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have a distribution very close to the normal. Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed.

Linear regression fit

Nonlinear LOESS regression fit – loess stands for locally estimated scatter-plot smoothing (lowess stands for locally weighted scatter-plot smoothing) and is one of many non-parametric regression techniques, but arguably the most flexible.

http://n-steps.tetratech-ffx.com/PDF&otherFiles/stat_anal_tools/LOESS_final.pdf

Do you know the difference between homoscedasticity and heteroscedasticity ?

Homoscedasticity versus heteroscedasticity

Plot with random data showing homoscedasticity
A plot with random data showing heteroscedasticity

Homoscedasticity can be also called homogeneity of variance, because it is about a situation, when the sequence or vector of rando variable have the same finite variance. And as we probably know already – variance measures how far a set of numbers is spread out. The complementary notion is called heteroscedasticity, to sum up, it means that:

  • In statistics, a sequence or a vector of random variables is homoscedastic /ˌhoʊmoʊskəˈdæstɪk/ if all random variables in the sequence or vector have the same finite variance.
  • A collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others.