Extracting features

Before we can process our samples, we need to create a list of which samples we want to process. We can get a list (or, more acurately a vector) of files using Sys.glob, which allows wild-card matching of files. As we want to bind some more data to these filenames, we convert the vector into a data frame. We’ve been told the following about our samples - Samples differ on three levels - Soil type (3 types) - Biological replicate (3 per soil type) - Technical replicate (3 per biological replicate) - The samples were run in order of soil type, biological rep, technical rep. E.g. the first sample is Soil A, biorep 1, techrep 1, the next is Soil A, biorep 1, techrep 1, and so on, with the last being Soil O, biorep 3, techrep 3. - The order of soil is A (Agricultural), F (Forest), O (Organic)

We can repeat an element of a vector using the rep() function. E.g:

rep('A', 3)

## [1] "A" "A" "A"

gives us A A A A vector itself can also be repeated e.g.:

rep(c('A','B'), 3)

## [1] "A" "B" "A" "B" "A" "B"

gives us A B A B A B. This can also be done by element:

rep(c('A','B'), each=3)

## [1] "A" "A" "A" "B" "B" "B"

which gets us A A A B B B

Given the structure below, how can we fill in the missing annotations (i.e. replace the 0s)?

samples<-as.data.frame(Sys.glob("SUM/*.mzML"), stringsAsFactors = F)
colnames(samples)[1]<-"filename"
samples$soil<-0
samples$biorep<-0
samples$techrep<-0
samples$bio_id<-paste0(samples$soil,samples$biorep)
samples$tech_id<-paste0(samples$bio_id,".",samples$techrep)
samples

##                      filename soil biorep techrep bio_id tech_id
## 1  SUM/HW-290120-007.sum.mzML    0      0       0     00    00.0
## 2  SUM/HW-290120-008.sum.mzML    0      0       0     00    00.0
## 3  SUM/HW-290120-009.sum.mzML    0      0       0     00    00.0
## 4  SUM/HW-290120-010.sum.mzML    0      0       0     00    00.0
## 5  SUM/HW-290120-011.sum.mzML    0      0       0     00    00.0
## 6  SUM/HW-290120-012.sum.mzML    0      0       0     00    00.0
## 7  SUM/HW-290120-013.sum.mzML    0      0       0     00    00.0
## 8  SUM/HW-290120-014.sum.mzML    0      0       0     00    00.0
## 9  SUM/HW-290120-015.sum.mzML    0      0       0     00    00.0
## 10 SUM/HW-290120-016.sum.mzML    0      0       0     00    00.0
## 11 SUM/HW-290120-017.sum.mzML    0      0       0     00    00.0
## 12 SUM/HW-290120-018.sum.mzML    0      0       0     00    00.0
## 13 SUM/HW-290120-019.sum.mzML    0      0       0     00    00.0
## 14 SUM/HW-290120-020.sum.mzML    0      0       0     00    00.0
## 15 SUM/HW-290120-021.sum.mzML    0      0       0     00    00.0
## 16 SUM/HW-290120-022.sum.mzML    0      0       0     00    00.0
## 17 SUM/HW-290120-023.sum.mzML    0      0       0     00    00.0
## 18 SUM/HW-290120-024.sum.mzML    0      0       0     00    00.0
## 19 SUM/HW-290120-025.sum.mzML    0      0       0     00    00.0
## 20 SUM/HW-290120-026.sum.mzML    0      0       0     00    00.0
## 21 SUM/HW-290120-027.sum.mzML    0      0       0     00    00.0
## 22 SUM/HW-290120-028.sum.mzML    0      0       0     00    00.0
## 23 SUM/HW-290120-029.sum.mzML    0      0       0     00    00.0
## 24 SUM/HW-290120-030.sum.mzML    0      0       0     00    00.0
## 25 SUM/HW-290120-031.sum.mzML    0      0       0     00    00.0
## 26 SUM/HW-290120-032.sum.mzML    0      0       0     00    00.0
## 27 SUM/HW-290120-033.sum.mzML    0      0       0     00    00.0

Expected output

	filename	sum_filename	soil	biorep	techrep	bio_id	tech_id
1	RAW/HW-290120-007.mzML	SUM/HW-290120-007.sum.mzML	A	1	1	A1	A1.1
2	RAW/HW-290120-008.mzML	SUM/HW-290120-008.sum.mzML	A	1	2	A1	A1.2
3	RAW/HW-290120-009.mzML	SUM/HW-290120-009.sum.mzML	A	1	3	A1	A1.3
4	RAW/HW-290120-010.mzML	SUM/HW-290120-010.sum.mzML	A	2	1	A2	A2.1
5	RAW/HW-290120-011.mzML	SUM/HW-290120-011.sum.mzML	A	2	2	A2	A2.2
6	RAW/HW-290120-012.mzML	SUM/HW-290120-012.sum.mzML	A	2	3	A2	A2.3
7	RAW/HW-290120-013.mzML	SUM/HW-290120-013.sum.mzML	A	3	1	A3	A3.1
8	RAW/HW-290120-014.mzML	SUM/HW-290120-014.sum.mzML	A	3	2	A3	A3.2
9	RAW/HW-290120-015.mzML	SUM/HW-290120-015.sum.mzML	A	3	3	A3	A3.3
10	RAW/HW-290120-016.mzML	SUM/HW-290120-016.sum.mzML	F	1	1	F1	F1.1
11	RAW/HW-290120-017.mzML	SUM/HW-290120-017.sum.mzML	F	1	2	F1	F1.2
12	RAW/HW-290120-018.mzML	SUM/HW-290120-018.sum.mzML	F	1	3	F1	F1.3
13	RAW/HW-290120-019.mzML	SUM/HW-290120-019.sum.mzML	F	2	1	F2	F2.1
14	RAW/HW-290120-020.mzML	SUM/HW-290120-020.sum.mzML	F	2	2	F2	F2.2
15	RAW/HW-290120-021.mzML	SUM/HW-290120-021.sum.mzML	F	2	3	F2	F2.3
16	RAW/HW-290120-022.mzML	SUM/HW-290120-022.sum.mzML	F	3	1	F3	F3.1
17	RAW/HW-290120-023.mzML	SUM/HW-290120-023.sum.mzML	F	3	2	F3	F3.2
18	RAW/HW-290120-024.mzML	SUM/HW-290120-024.sum.mzML	F	3	3	F3	F3.3
19	RAW/HW-290120-025.mzML	SUM/HW-290120-025.sum.mzML	O	1	1	O1	O1.1
20	RAW/HW-290120-026.mzML	SUM/HW-290120-026.sum.mzML	O	1	2	O1	O1.2
21	RAW/HW-290120-027.mzML	SUM/HW-290120-027.sum.mzML	O	1	3	O1	O1.3
22	RAW/HW-290120-028.mzML	SUM/HW-290120-028.sum.mzML	O	2	1	O2	O2.1
23	RAW/HW-290120-029.mzML	SUM/HW-290120-029.sum.mzML	O	2	2	O2	O2.2
24	RAW/HW-290120-030.mzML	SUM/HW-290120-030.sum.mzML	O	2	3	O2	O2.3
25	RAW/HW-290120-031.mzML	SUM/HW-290120-031.sum.mzML	O	3	1	O3	O3.1
26	RAW/HW-290120-032.mzML	SUM/HW-290120-032.sum.mzML	O	3	2	O3	O3.2
27	RAW/HW-290120-033.mzML	SUM/HW-290120-033.sum.mzML	O	3	3	O3	O3.3

Solution

samples$soil<-c(rep('A',9),rep('F',9),rep('O',9))
samples$biorep<-c(rep(rep(1:3, each=3),3))
samples$techrep<-c(rep(1:3,9))
samples$bio_id<-paste0(samples$soil,samples$biorep)
samples$tech_id<-paste0(samples$bio_id,".",samples$techrep)
samples

##                      filename soil biorep techrep bio_id tech_id
## 1  SUM/HW-290120-007.sum.mzML    A      1       1     A1    A1.1
## 2  SUM/HW-290120-008.sum.mzML    A      1       2     A1    A1.2
## 3  SUM/HW-290120-009.sum.mzML    A      1       3     A1    A1.3
## 4  SUM/HW-290120-010.sum.mzML    A      2       1     A2    A2.1
## 5  SUM/HW-290120-011.sum.mzML    A      2       2     A2    A2.2
## 6  SUM/HW-290120-012.sum.mzML    A      2       3     A2    A2.3
## 7  SUM/HW-290120-013.sum.mzML    A      3       1     A3    A3.1
## 8  SUM/HW-290120-014.sum.mzML    A      3       2     A3    A3.2
## 9  SUM/HW-290120-015.sum.mzML    A      3       3     A3    A3.3
## 10 SUM/HW-290120-016.sum.mzML    F      1       1     F1    F1.1
## 11 SUM/HW-290120-017.sum.mzML    F      1       2     F1    F1.2
## 12 SUM/HW-290120-018.sum.mzML    F      1       3     F1    F1.3
## 13 SUM/HW-290120-019.sum.mzML    F      2       1     F2    F2.1
## 14 SUM/HW-290120-020.sum.mzML    F      2       2     F2    F2.2
## 15 SUM/HW-290120-021.sum.mzML    F      2       3     F2    F2.3
## 16 SUM/HW-290120-022.sum.mzML    F      3       1     F3    F3.1
## 17 SUM/HW-290120-023.sum.mzML    F      3       2     F3    F3.2
## 18 SUM/HW-290120-024.sum.mzML    F      3       3     F3    F3.3
## 19 SUM/HW-290120-025.sum.mzML    O      1       1     O1    O1.1
## 20 SUM/HW-290120-026.sum.mzML    O      1       2     O1    O1.2
## 21 SUM/HW-290120-027.sum.mzML    O      1       3     O1    O1.3
## 22 SUM/HW-290120-028.sum.mzML    O      2       1     O2    O2.1
## 23 SUM/HW-290120-029.sum.mzML    O      2       2     O2    O2.2
## 24 SUM/HW-290120-030.sum.mzML    O      2       3     O2    O2.3
## 25 SUM/HW-290120-031.sum.mzML    O      3       1     O3    O3.1
## 26 SUM/HW-290120-032.sum.mzML    O      3       2     O3    O3.2
## 27 SUM/HW-290120-033.sum.mzML    O      3       3     O3    O3.3

suppressWarnings(rawdata<-importMzMl(samples$filename, centroided = F, verbose=F))
length(rawdata)

## [1] 27

mzroi<-c(200,210)
inroi<-c(0,10000)
plot(rawdata[[1]])

plot(rawdata[[1]], xlim=mzroi, ylim=inroi)

spectra<-transformIntensity(rawdata, method="sqrt")
plot(spectra[[1]], xlim=mzroi, ylim=sqrt(inroi))

How much to smooth? Rule of thumb, as many data points in the smooth as there are in the top half of the peak.

plot(spectra[[1]], xlim=c(205.5,206.5), ylim=c(0,40), type='b')
abline(h=20, col='red')

Assuming the signal between 206.0 and 206.2 is a single peak, we count roughly 11 data points between when it first goes above 20 to when it last drops below 20 counts. For the Savitzky golay filter, we specify the number of points to include before and after the data point being calculated (the halfwindow size). The total window is 2 * halfwindow + 1 (the data point itself). If we have a full window of 11, the halfwindow size is therefore 5.

testspectrumhw5<-smoothIntensity(spectra[[1]], method="SavitzkyGolay", halfWindowSize=5)
plot(testspectrumhw5, xlim=c(205.5,206.5), ylim=c(0,40), type='b')

This still preserves part of the ‘twin’ peak shape. If we increase the smoothing further, we start to lose this information:

testspectrumhw11<-smoothIntensity(spectra[[1]], method="SavitzkyGolay", halfWindowSize=11)

## Warning in .replaceNegativeIntensityValues(object): Negative intensity values
## are replaced by zeros.

plot(spectra[[1]], xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="No smoothing")

plot(testspectrumhw5, xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="SG halfWindowSize 5")

plot(testspectrumhw11, xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="SG halfWindowSize 11")

The halfwindow of 5 looks like it will smooth the data, without losing all the detail. Here’s a slightly more zoomed out view and the same view for the non-smoothed data.

plot(testspectrumhw5, xlim=mzroi, ylim=sqrt(inroi), type='l')

plot(spectra[[1]], xlim=mzroi, ylim=sqrt(inroi), type='l')

We stick with a halfWindowSize of 5, and apply it to all spectra.

spectra<-smoothIntensity(spectra, method="SavitzkyGolay", halfWindowSize=5)

Next, what do baselines look like?

baselinetest <- estimateBaseline(spectra[[1]], method="SNIP", iterations=100)
plot(spectra[[1]])
lines(baselinetest, col='red')

It looks a little bumpy. In close up:

plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))
lines(baselinetest, col='red')

Increasing the interationcount makes it less sensitive:

baselinetest <- estimateBaseline(spectra[[1]], method="SNIP", iterations=250)
plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))
lines(baselinetest, col='red')

plot(spectra[[1]])
lines(baselinetest, col='red')

As it now looks smoother, we apply these settings for baseline removal. As an output to check what’s happened, we include a couple of plots.

spectra <- removeBaseline(spectra, method="SNIP", iterations=250)
plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))

plot(spectra[[1]])

In addition, we also want normalise for overall intensity, or total ion count. We can get the sums of the intensities to check for big differences

intensities<-sapply(spectra, function(spectrum){sum(intensity(spectrum))})
barplot(intensities/1e6, names.arg = samples$tech_id, xlab = "sample", ylab = "total intensity (in 1e6 counts)", las=2)

For the actual calibration, and comparison afterwards:

spectra <- calibrateIntensity(spectra, method="TIC")
intensities<-sapply(spectra, function(spectrum){sum(intensity(spectrum))})
barplot(intensities, names.arg = samples$tech_id, xlab = "sample", ylab = "total intensity", las=2)

Next we call some tentative peaks and see how well our samples are aligned

testpeaks<-detectPeaks(spectra, halfWindowSize=5, SNR=3)
warps<-determineWarpingFunctions(testpeaks, plot = T, tolerance = 100e-6)

Check output pdf. Seems to warp slightly at the edges, but overall no big effect. Warp:

spectra_aligned <- alignSpectra(spectra,
                        halfWindowSize=5,
                        SNR=3,
                        tolerance=100e-6,
                        warpingMethod="lowess")

What is the effect?

testpeaks_after<-detectPeaks(spectra_aligned, halfWindowSize=5, SNR=3)
plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
lines(spectra[[18]], col='grey', lty=3)
lines(spectra_aligned[[7]], col='blue')
lines(spectra[[7]], col='cyan3', lty=3)
points(testpeaks[[18]], col='grey', pch=4)
points(testpeaks[[7]], col='cyan3', pch=4)
points(testpeaks_after[[18]], pch=4)
points(testpeaks_after[[7]], col='blue', pch=4)

Looks like it might overcompensate a little, but does bring them closer together. Check if true:

peakmzs_18<-mz(testpeaks[[18]])
peakmzs_07<-mz(testpeaks[[7]])
peakmzs_after_18<-mz(testpeaks_after[[18]])
peakmzs_after_07<-mz(testpeaks_after[[7]])
peakmzs_18[peakmzs_18 > 1062 & peakmzs_18 < 1063] - 
  peakmzs_07[peakmzs_07 > 1062 & peakmzs_07 < 1063]

## [1] -0.01416016

peakmzs_after_18[peakmzs_after_18 > 1062 & peakmzs_after_18 < 1063] - 
  peakmzs_after_07[peakmzs_after_07 > 1062 & peakmzs_after_07 < 1063]

## [1] 0.003835096

In this case we can also consider to not align, as the random variation between the peaks in each sample is larger than the very mild correction we apply by warping the spectra.

We also still have some noise in the data - this will always be the case. In most applications we want to avoid looking at the noise too much, so often a Signal to noise ratio (SNR) is given as a criteria for peak calling. This means that to be be regarded as a peak, the peak height needs to be above a number of times the noise level determined by the SNR. We can investigate our data for noise like so:

noise<-MALDIquant::estimateNoise(spectra_aligned[[1]])
plot(spectra_aligned[[1]], xlim = mzroi, ylim=c(0,0.006))
lines(noise, col="red")
lines(noise[,1],noise[,2]*5, col="purple")
lines(noise[,1],noise[,2]*10, col="blue")

Now it looks like an SNR of 5 (purple line) would already cut out some peaks. However, this picture may be misleading. Let’s look at a much higher m/z range:

plot(spectra_aligned[[1]], xlim=c(1060,1070), ylim=c(0,0.01))
lines(noise, col="red") #Red line: noise level
lines(noise[,1],noise[,2]*5, col="purple") #Purple line: noise level * 5
lines(noise[,1],noise[,2]*10, col="blue") #Blue line: noise level * 10

Here it look like we get a lot of repeating signal just above the 5 SNR line. An SNR of 8, or possibly even 10 might prove a better balance between restricting noise and throwing away signal.

We can then call the peaks with some additional refinements:

peaks <- detectPeaks(spectra_aligned, method="MAD", halfWindowSize=5, SNR=8, refineMz = "descendPeak", signalPercentage=25)

As we saw previously, even after alignment there are small difference in the peak m/z values. For an illustration:

plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
for(n in seq_along(peaks)){
  points(peaks[[n]], pch = 4, col=rainbow(length(peaks))[[n]])
}

To get rid of those, we bin the Peaks from different samples within a certain tolerance together.

peaks<-binPeaks(peaks, method="strict", tolerance = 50e-6)
plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
for(n in seq_along(peaks)){
  points(peaks[[n]], pch = 4, col=rainbow(length(peaks))[[n]])
}

After this, we can gather our peaks and intensities:

featureMatrix <- intensityMatrix(peaks, spectra_aligned)

As you’ll have noticed, intensityMatrix is also reading in our aligned spectra. Why would it do that?

The featureMatrix object is, as the name suggests, a Matrix. One row for each sample, one column for each peak. How many peaks are there in the matrix?

ncol(featureMatrix)

## [1] 2041

peakspersample<-sapply(peaks, function(x){length(mz(x))})
barplot(peakspersample, names.arg = samples$tech_id, xlab = "sample", ylab = "peaks called", las=2)

Some peaks still appear to be random noise. Going back a step, we can filter peaks based on in which samples they are called. Based on the documentation for filterPeaks, what does the following do?

peaks_byrep<-filterPeaks(peaks, minFrequency = 1, labels = samples$bio_id, mergeWhitelists = TRUE)
peaks_byrep_bysample<-filterPeaks(peaks_byrep, minFrequency = 0.5, labels = samples$soil, mergeWhitelists = TRUE)

We can then get a Matrix with just the reliable features:

featureMatrix_filter <- intensityMatrix(peaks_byrep_bysample, spectra_aligned)

And check how many peaks we retained:

ncol(featureMatrix_filter)

## [1] 928

We now know where there are peaks, and what the intesity is in each sample. We don’t yet know whether there are any differences between samples. A lot of analysis can be done in R, but alternatively we can export the data as csv files to enable import in other applications which are more user friendly:

intensities_out<-data.frame(Sample=samples$tech_id,
                            Label=samples$soil,
                            featureMatrix_filter, check.names = F, stringsAsFactors = F)
intensities_out<-rbind(c(NA, "rt", rep(1, ncol(intensities_out)-2)), intensities_out) #Create fake RT
intensities_out_ex<-intensities_out %>% dplyr::filter(Sample %in% c("A3.3")) #Create fake RT
intensities_OF<-intensities_out %>% dplyr::filter(Label %in% c("O","F","m/z"))
write.csv(intensities_OF, row.names = F, file = "Intensities_OF.csv")
write.csv(intensities_out, row.names = F, file = "Intensities.csv")
write.csv(intensities_out_ex, row.names = F, file = "Intensities2.csv")

Extracting features

jhms

01/04/2020