PSPP – Statistical Software – Review

PSPP is an open source statistical analysis and data mining tool. It was designed as a free alternative to IBM’s SPSS tool.

PSPP is very similar to SPSS and includes most of it’s features. PSPP is capable of processing up to 1 billion cases and variables; offers both a graphical and terminal user interface and facilitates data import from spreadsheets, text files and databases. Most noteably PSPP has no license fees or expiration period. It can be run in Windows, Linux or Mac OSX environments.

This article reviews some of PSPP’s statistical tests; based on the server logs for this blog recorded in October 2010. The sample data focuses on 1503 unique visitors to the blog. The variables recorded include total hits to the blog, unique page hits, kilobytes of data downloaded, country of origin, time spent on the blog and the search term used to find the blog.

Single Variable Analysis

The table below shows the output from PSPP’s ‘Frequencies’ procedure. The frequencies procedure is used for analysing a single categorical variable. In this case we are comparing the different countries from which users visited the blog. From the results we can see that the majority of users were from the United States with China and the UK scoring equal second.

Value LabelValueFrequencyPercentValid PercentCum Percent
United States055336.7936.7936.79
China120013.3113.3150.10
Great Britain220013.3113.3163.41
Australia31006.656.6570.06
Poland41006.656.6576.71
Czech Republic5503.333.3380.04
Germany6503.333.3383.37
Brazil7503.333.3386.69
Canada8503.333.3390.02
India9503.333.3393.35
Russian Federation10503.333.3396.67
Netherlands11503.333.33100.00
Total1503100.0100.0 

Table produced by PSPP.

Chart produced using the Google Visualisation API.

Next we look at the ‘Explore’ procedure, this is used for analysing metric (numerical) variables, for example, the total number of hits made by each visitor. The descriptives table shown below gives us some useful information. Firstly it indicates that the mean number of hits made to the blog was 5 (rounded from 5.05), the median was 5, the minimum value was 1 (otherwise the visit couldn’t have been recorded) and the maximum number of pages visited was 15.

   StatisticStd. Error
Total number of hits made to the blogMean 5.05.05
 95% Confidence Interval for MeanLower Bound4.95 
  Upper Bound5.15 
 5% Trimmed Mean 4.97 
 Median 5.00 
 Variance 4.25 
 Std. Deviation 2.06 
 Minimum 1.00 
 Maximum 15.00 
 Range 14.00 
 Interquartile Range 2.00 
 Skewness .65.06
 Kurtosis .92.13

Table produced by PSPP.

The ‘Explore’ procedure is also capable of producing percentiles analysis. We can see from the table below that up to 25% of visitors viewed at least 6 pages, 50% of users viewed up to 5 pages and 75% of visitors viewed at least 4 pages.

  Percentiles
  5102550759095255075
Total number of hits made to the blogHAverage2.003.004.005.006.008.009.004.005.006.00
 Tukey’s Hinges  4.005.006.00  4.005.006.00

Table produced by PSPP.

Complimenting the ‘Explore’ procedure PSPP can produce a histogram. Histograms are useful in helping us to visualise the distribution of a metric variable. Our histogram shows us that the distribution is approximately symetric. This means that we should use the mean for reporting average number of hits to the site. If the histogram was not symetric the median would give us a better value to use for the average.

Chart produced by PSPP.

SPSS has one distinct advantage over PSPP when using the ‘Explore’ procedure. SPSS is capable of producing box plot charts. Box plots are another great way for us to visualise the distribution of a metric variable. The box plot below was produced using the Google Visualisation API with the data produced by PSPP. The top and bottom markers represent the minimum and maximum number of hits made by visitors. The box area represents the number of hits between the 25th and 75th quartiles (majority of visitors) and the line through the middle of the box represents the median.

Chart produced using the Google Visualisation API.

Hypothesis Testing

As well as single variable analysis PSPP gives us the opportunity to test hypothesis. For example we might hypothesise that vistors from the United States spent more time on the blog than visitors from China because the blog is written in English. To test this we could perform an independent samples t-test using PSPP. The table below shows that the mean number of minutes spent on the site was 1.36 minutes for US visitors. Higher than Chinese visitors who spent 0.83 minutes on the site. However the ouput also shows us that the difference is not statistically significant. The significance value (highlighted in red) is higher than 0.05. We can also see that the 95% confidence level is between -0.19 and 1.25. This indicates that the difference in the entire population could be either 0.19 minutes less or 1.25 more than the mean. Unfortunately we can not draw any conclusions in this case.

 COUNTRYNMeanStd. DeviationS.E. Mean
DURATIONUnited States5531.366.44.27
 China200.833.49.25
  Levene’s Test for Equality of Variancest-test for Equality of Means
         95% Confidence Interval of the Difference
  FSig.tdfSig. (2-tailed)Mean DifferenceStd. Error DifferenceLowerUpper
DURATIONEqual variances assumed9.40.001.10751.00.27.53.37-.191.25
 Equal variances not assumed  1.43640.76.15.53.37-.201.25

Table produced by PSPP.

Looking for relationships

As well as t-tests PSPP can perform regression analysis, useful when trying to identify relationships between two metric (numerical) variables. Say we suggested that as avisitor visits more pages, more data is downloaded from the server. The Correlations table below shows that the correaltion between these variables is 1.0 (highlighted in red). This indicates a very strong relationship. The coefficients table indicates that the slope is 22.8 (highlighted in blue). This indicates that on average, for every page visited, an extra 22.8kb of data was downloaded from the server. We can also see that the significance of the test is 0 (highlighted in green), as this is less than 0.05, we can conclude that the relationship is significant. As expected as page hits increase more data is downloaded from the server. Perhaps a little obvious however a test like this could help validate the data.

  Total number of hits made to the blogBandwidth downloaded by user (kilobytes)
Total number of hits made to the blogPearson Correlation1.001.00
 Sig. (2-tailed) .00
 N15031503
Bandwidth downloaded by user (kilobytes)Pearson Correlation1.001.00
 Sig. (2-tailed).00 
 N15031503
 RR SquareAdjusted R SquareStd. Error of the Estimate
 1.001.001.00.00
  Sum of SquaresdfMean SquareFSignificance
 Regression3166653131666539.7E+015.00
 Residual.001501.00  
 Total31666531502   
  BStd. ErrorBetatSignificance
 (Constant).00.00.00.001.00
 Total number of hits made to the blog22.28.001.0098362442.00
       

Table produced by PSPP.

Finally, PSPP allows us to perfrom Crosstabs analysis which helps us to identify relationships between two categorical variables. In this case we have produced a crosstabs analysis, comparing the search term used to find the blog to the country of origin. One of the search terms was related to a blog post describing how to use Google’s language API. We might assume that non-english countries would be more likely to search for this as opposed to English speaking countries, as most of the webs content is in English. The crosstabs table below shows us this is the case. However a chi square test indicates that this test is not statistically significant. The significance value is greater than 0.05 (highlighted in red). Unfortunately we do not have enough information to draw any conclusions.

Summary.
  Cases
  Valid Missing Total
  N Percent N Percent N Percent
Search Term * Country Of Origin 723 48.1% 780 51.9% 1503 100.0%
 
  COUNTRY  
SEARCH United States China Great Britain Australia Poland Czech Republic Germany Brazil Canada India Russian Federation Netherlands Total
as3 iterate through display objects 33.0 12.0 11.0 4.0 4.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 78.0
  12.4% 12.5% 11.5% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 8.3% 10.8%
curl web crawler in php 33.0 12.0 10.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 88.0
  12.4% 12.5% 10.4% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.2%
google language api php example 33.0 12.0 15.0 8.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 104.0
  12.4% 12.5% 15.6% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 14.4%
google web toolkit animation effects 49.0 16.0 16.0 8.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 125.0
  18.4% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 17.3%
iterating dom childnodes 31.0 12.0 12.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 88.0
  11.6% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.2%
kinematics in flash animation 19.0 8.0 8.0 3.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 47.0
  7.1% 8.3% 8.3% 6.3% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 4.2% 6.5%
papervision 3d rotate cube 33.0 12.0 12.0 7.0 8.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 99.0
  12.4% 12.5% 12.5% 14.6% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 16.7% 12.5% 13.7%
zend amf example 36.0 12.0 12.0 6.0 6.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 94.0
  13.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5% 16.7% 13.0%
Total 267.0 96.0 96.0 48.0 48.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 723.0
  100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
StatisticValuedfAsymp. Sig. (2-tailed)
Pearson Chi-Square10.29771.00
Likelihood Ratio10.49771.00
Linear-by-Linear Association.351.55
N of Valid Cases723  

Table produced by PSPP.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *