Statistics in medical research--I.SK Bowalekar
Keywords: Data Interpretation, Statistical, Female, Human, Infant, Newborn, Male, Research, statistics &numerical data,Statistics, methods,
"What are the chances of my getting cured"?
"Is the new procedure/drug better than that routinely used"?
"If yes, by how much"?
These are some of the questions commonly faced by the common man, the practicing physician and even Drug Regulatory Authorities. Prime facie these questions appear to be extremely simple and one gets away or could get away by providing superficial answers. However, a deeper look at these questions will reveal their quantitative orientation. All these questions are statistical in nature and in order to provide sound and realistic answers, one has to rely on the results of statistical analysis.
With technological progress, the demand for quantitative rather than descriptive information is increasing rapidly in all fields and medicine is no exception to the rule. Research scholars in various fields including medicine and surgery present their findings and numerical evidence at conferences, seminars, discussions and through publications. Medical literature is thus flooded with statistical and numerical data.
In order to evaluate critically the numerical data and derive meaningful conclusions there is a definite need for a greater and proper understanding of statistical tools and techniques.
Statistics is perceived and defined differently by different people. Some people look at a statistician as a manipulator of numbers or a person busy in compiling tables of numbers. A sizable section of the community has started viewing statistics as a science helpful in decision making and in providing tools and techniques for meaningful interpretation of voluminous numerical data.
Formally, statistics is defined as a branch of science dealing with the designing of experiments and surveys leading to the collection of numerical facts, and the methods of presenting, analyzing, and interpreting these numerical facts,
Biometry or Biostatistics may be defined as the application of statistics as a science to biology/life sciences.
The subject matter of statistics can be broadly classified into three heads:
(i) Descriptive Statistics
(ii) Inferential Statistics and
(iii) Statistical decision theory.
Descriptive statistics deals with scientific methods of sampling, collection, condensation, presentation and analysis of numerical data. Generalisation of results obtained from a sample to a population using an appropriate experiment design and relevant statistical tests are covered under inferential statistics. Statistical decision theory discusses the methods of probability and its applications in decision making under uncertainties.
In order to understand the various operational procedures in statistics, it is very essential to have knowledge of the basic concepts in statistics.
1. Population: Population is a well-defined set of objects about which the study is being carried out. "All hypertensive patients under the sun" comprises the population in a study conducted to examine smoking habits among hypertensives.
2. Sample: A sample is a subgroup of the population. It is a finite set of objects selected from the population in order to make observations on them. In the above example, sample is a set of "hypertensive patients selected for a study" conducted to examine smoking habits among hypertensive.
3. Characteristics: An item on which the information is collected is termed a characteristic. Age, sex, number of cigarettes smoked etc are characteristics.
4. Attribute: A characteristic, which cannot be expressed numerically but can only be expressed qualitatively, is called an attribute. Sex, response, colour etc are samples of attributes. Sex cannot be expressed numerically but has categories like males and females. Response (good/bad) and colours (white/black) too, cannot be expressed numerically.
5. Variates (or) Variables: A characteristic, which can be expressed numerically, is called variate or variable. Weight, height, blood pressure etc are examples of variables.
5.1 Continuous variable: A variable which can take any value from a range, is called a continuous variable, eg height which can take any value between 4 ft and 6 ft is a continuous variable.
5.2 Discrete variable: A variable which takes only isolated values is called discrete variable. Number of boys in a class, number of patients responding to treatment are examples of discrete variables.
6. Parameter: Any measurable characteristic describing the population or universe is called a parameter. Thus a parameter is estimated using all observations from a population. The mean haemoglobin (Hb) level of all males from India is a parameter as it is estimated using Hb levels of each and every male from India.
7. Statistic: Any measurable characteristic describing the sample is called a statistic. Thus, statistic is estimated using observations only from a sample. For example, mean Hb of a few selected males (sample of males) is a statistic as it is computed using the observations from a sample.
To distinguish between the two, it is customary to denote the parameters with Greek letters and the statistics using Latin letters. Thus, the mean and standard deviation of population are labelled as pt and 6 respectively and those of sample as x and s respectively.
8. Degrees of freedom: Degrees of freedom (df) is defined as the number of ways in which a set of numbers in a group can vary independently, eg consider a set of 18 measurements of diastolic blood pressure levels. If the mean of these 18 observations is calculated, then the series will have (n-1); i.e. (18-1) or 17 if because the 18th value cannot be changed without disturbing the mean.
As mentioned earlier, descriptive statistics involves organizing, analysing and presenting statistical data. Statistical tools and techniques for analysing data are closely related to scales of measurement. Measurement may be defined as the assignment of numbers to characteristics of objects according to certain rules. Although a tremendous variety of measurement devices are available for gathering information, all measurements can be classified into one of the four basic scales of measurement; viz. nominal scale, ordinal scale, interval scale and ratio scale.
a) Nominal scale: A nominal scale is a scale in which observations are classified into mutually exclusive groups. Sex, occupation, blood group are examples of items measured on a nominal scale as data on sex can be classified into two groups: males and females that on occupation can be grouped as business and service and so on.
b) Ordinal scale: Ordinal scale is an extension of the nominal scale. Besides all the information of a nominal scale, the ordinal scale also provides information about the ordered relationship among the different classes or groups. In medicine, there are many items that are measured on the ordinal scale. Severity of pain, response to therapy and improvement in the status are some characteristics that are measured on the ordinal scale. Here the severity of pain is generally classified as no pain, mild pain, moderate pain and severe pain and each class has some relationship with the other classes e.g. severe is the most harmful condition followed by moderate, mild and no pain. Thus, no pain is assigned a number 0, mild pain a number 1, moderate a number 2 and severe pain a number 3. In practice, severity of pain is also said to be measured on a four point scale.
Interval scale and ratio scale: Interval scale is a scale having a uniform or constant interval between consecutive marks. It is different from the ordinal scale in so far as the distance between consecutive marks is concerned. For example, in severity of pain, the interval between consecutive marks say mild=l and moderate=2 cannot be asserted as equal to the interval between moderate=2 and severe=3.
Ratio scale is an extensive of the interval scale. Interval scale does not include true zero whereas the ratio scale includes a true zero. Weight, height, haemoglobin and blood pressure are some of the items measured on the ratio scale while temperature measured in celsius and Fahrenheit are examples of the interval scale.
A long list of values of variables, not subjected to formal statistical processing forms raw data. It is not possible to draw any meaningful conclusion from raw data. Hence, it is necessary to present it in some easily interpretable format. The basic rule for displaying qualitative data is to count the number of observations in each category of variables and present the numbers in a table.
[Table - 1] giving the results of study on the efficacy of an antibiotic in 1000 patients is presented here as an illustration.
The numbers in the body of the table are called as frequencies. Frequencies are in the counts of observations in each group or class. The sum of frequencies in each column or each row makes up the total frequency or total number of observations.
The basic rule for displaying quantitative data is the same as that for displaying qualitative data but in quantitative data, categories have to be created by grouping values of variables. [Table - 2], which gives the distribution of babies by birth weight, is an example of such a display in the form of a table.
This table is called frequency distribution table" or just a distribution table. In the first column, the limits of each class are displayed. There are two limits, one lower and the other upper, which are accordingly labelled as lower class limit and upper class limit respectively. Here, the upper class limit of the previous class coincides with the lower class limit of the class interval under consideration. Such a method is called as exclusive method of presenting class intervals. Under the exclusive method, the upper limit of each class is not included in that class for counting frequency.
The other method of presenting class intervals is the method of inclusion. Under method of inclusion, the class intervals for data given in [Table - 2] will become 0 - 0.9, 1 - 1.9, 2 - 2.9 -----upto 5 - 5.9.
The difference between consecutive lower or upper limits is called as width of class and the midpoint of any class is called as the class mark of the concerned class interval. For example, the class mark of the first class is the mid-point of the class interval 0 - 1 and is equal to sum of lower and upper class limits divided by two. In the present example it is equal to (0 + 1) / 2 = 0.5. Class mark is a representative value of the variable for the given class interval.
Graphs: "A picture is worth a thousand words" Chinese Proverb
The above-mentioned proverb explains the importance of a pictorial presentation of data. Pictures, graphs or diagrams aid the reader by saving his time and highlighting the important points. Methods of graphical presentations are to be decided on the basis of the type of data as qualitative and quantitative data are presented differently.
There are four ways of presenting qualitative data bar chart, proportional bar chart, pie diagram and pictogram.
(i) Bar chart: in a bar chart, categories of variables are shown on the horizontal i.e. X-axis (abscissa) and the frequency is plotted on the vertical i.e. Y-axis (ordinate). Bars are then constructed to show the frequency or relative frequency (ratio of frequency of class to total frequency) for each category. [Figure:1] gives a bar chart showing males and females included in a study on an antibiotic [Table - 1] and [Figure:2] gives the same chart by response categories.
Pie diagram: In a pie diagram a circle is drawn with the angle at (3600) or the area of the circle representing total frequency. The circle is then divided into sectors with angles at the centre or area of the sector proportional to the observed frequency in each category of concerned variable. [Figure:3] and [Figure:4]are pie diagrams of the same data [Table - 2] on sex and response categories.
(iii) Proportional bar chart: Proportional bar chart is a bar chart with subdivisions having their lengths or areas proportional to the frequencies of respective categories of the variable. [Figure:5] gives a proportional bar chart for data on sex distribution given in [Table - 2].
(iv) Pictogram: In this method a suitable picture is used to represent the frequency in each category of the variable. In respect of data on males and females in [Table - 2], a picture of a male and female can be used to represent males and females respectively with a suitable scale like 1 picture equal to 100 units. Thus a pictogram for data on sex distribution will appear as shown in [Figure:6]
Quantitative data is presented using the following graphs: frequency polygon, frequency curve, cumulative frequency graph (also called as ogive) and histogram.
(i) Frequency polygon: In this presentation, class marks of class intervals are shown on the X-axis and the corresponding frequencies are plotted on the Y-axis. The two consecutive points thus marked are joined by straight lines. [Figure:7] shows a frequency polygon for data presented in [Table - 2]. In [Figure:7], against class marks 0.5, 1.5, .... and 5.5, frequencies given in [Table - 2] are shown by points on the Y - axis and the points joined by straight fines form a frequency polygon.
Frequency curve: Frequency curve is constructed in the same way as a frequency polygon but the points are joined by a smooth curve instead of a straight line. [Figure:8] is a frequency polygon for data in [Table - 2].
iii) Cumulative frequency curves (Ogives): Cumulative frequency for the first class interval is the same as its actual frequency. Cumulative frequency (CF) of any given class interval is obtained by adding its actual frequency to the cumulative frequency upto the previous class interval. Cumulative frequencies for data given in [Table - 2] are given in [Table - 3].
Cumulative frequency of 40 shown against class interval 2-3 under the column less than type gives directly the number of babies weighting less than 3 kgs, whereas cumulative frequency in the same row under greater than type, viz. 82 gives the number of babies weighing above or greater than 2 kgs.
These cumulative frequencies plotted against respective class limits (lower for cumulative frequency greater than type and upper for less than type) and then joined by smooth curves are given in [Figure:9] and [Figure:10] respectively.
Cumulative frequency curves useful, in determining locational measures like, percentiles, deciles, quartiles and the median of the data.
(iv) Histogram: Histogram is a set of rectangles standing side by side with bases equal to width of the class intervals and areas proportional to the respective class frequencies. Thus, in practice, each rectangle is constructed with bases equal to the class intervals and height equal to the ratio of the class frequency to width of the concerned class (i.e. f/w).
The process of construction a rectangle becomes very easy when the class widths are equal. The data on birth weight of babies presented in [Table - 2] can be represented as shown in [Figure:11].
Here, widths of the class intervals being the same, respective frequencies are shown as heights of different class intervals. It must be noted that technically there should be no gap between the successive rectangles.
In case the data are given in a frequency distribution table of the inclusive type, the gap between the upper limit of the class interval and the lower limit of the successive class interval.
The new upper limit and lower limits are called class boundaries the following examples will illustrate this concept of class boundaries.
It can be noticed in the above example that the upper limit 4 of the class interval 0 - 4 is extended to
4 + ½ (difference between lower limit (5) of next class and lower limit (4) of class concerned).
= 4 + 1/2 (5 - 4)
In the same way, the lower limit 5 of class 5 - 6 gets changed to 4.5.
This adjustment is required to be done before constructing the histogram in order to avoid gaps between successive rectangles. Histograms are useful in graphically determining the most frequently repeated (mode) value in the given data.
[Table - 1], [Table - 2], [Table - 3], [Table - 4]