remove outliers Remove outliers from document file columns
file format: SPIDER document file PURPOSE: Remove outliers from columns in a document file. The program was to sort through the coorinates created by correspondence analysis or PCA. Since real outlier images have a major influence on the direction of the eigenvectors, it is important to remove them. Be carefull if you are removing more that 10-20% of your images. You may be removing a complete cluster. Best is to cofirm the removal by checking the factor map for the location of these particles. Since the output file contains the information, which coordinate caused the removal it is easy to look at the corresponding map to visually confirm this. USAGE: remove outliers .Input coordinate docfile: imccoord001 [Enter the name of the document file that you want to process.] .Output selection docfile: removelines001 [Enter name of output document file. This will have the keys of the lines to be removes, followed by a 0, followed by the input column number that was the reason for the removal. Keys may be occuring multiple times, if the exclusion is based on multiple columns. This file can be appended to a typical 0/1 selection document file, or, if such an existing selection file is entered as output name, it will be appended.] .Output doc file format (0=new,default,1=old): 1 [Enter the format of the output doc file. default is 0 (= new format). option 1 added for compatibility with SPIDER version 5.0]. .Columns to include: 2-4,6 [Enter which columns of theinput document file should be checked for outliers.] .Sigma multiplier for threshold: 3.3 [Enter the factor by which the standard deviation of a column is multiplied to determine outliers. Outliera are those that have values smaller than average-factor*sigma or larger than average_factor*sigma.] .Number of columns to write to the output docfile: 3 [The minimum number is 1, which will only write the 0 to the output file. 2 will also wrte the column that caused the out lier. Anything larger will be 0s. The reason to put multiple columns is because the selection file used may have extra information in each line, and adding a shorter line could create problems in reading the file later.] Programs: em_removeoutliers.py, doceliminate.f Author(s): M. Radermacher