Home

Overview

This wiki page will go over various methods of detecting audio events and links to helpful references.

The primary event that we were concerned about when writing this page was train detection, but with the right detection values any event that generates unique audio can be found.

Getting Started

You will need to install a recording/analysis program to begin to find the correct parameters for an audio event. I am currently using Audacity. The program itself is useful, but it also allows plugins based on the Nyquist language to be used. Two plugins that will be very useful are Wave Stats and ACX Check. Download these two plugins, and follow the instructions on how to install them. After putting them in your plugins folder, go to Effect->Add/Remove Plugin to enable the plugins. The plugins should now be available under Analyze.

The python code we are using with the microphone records audio in 16 bit mono. You can have audacity record the same or manipulate it later to get it into the correct format. It is useful to be familiar with the Audacity Shortcuts, especially CTRL-D (Duplicate) as it will allow you to highlight useful parts of a larger file and extract just that part. To move audio from an audacity project to a .wav file for analysis, you need to go to File->Export Audio.

After you have recorded samples of an event you are trying to detect, you can do some preliminary analysis with the plugins that you just downloaded. Highlight a section of audio and go to Analyze->ACX Check. This will return something similar to this image:

This can be helpful when attempting to verify that your python code is returning correct dB and RMS values.

Background Reading on Event Detection

Before you proceed to the next sections, you should be somewhat familiar with the content in the following links:

http://samcarcagno.altervista.org/blog/basic-sound-processing-python/ - Basics of working with audio in python

https://pdfs.semanticscholar.org/cd34/eaa0029ba9f6cfc4a1ed64722e05a476079f.pdf - What the train detection code is based on

https://stackoverflow.com/questions/4160175/detect-tap-with-pyaudio-from-live-mic - Related Work

https://en.wikipedia.org/wiki/Decibel

Additionally, you can read some of the following for more information:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610 - A reference for where future work should head

http://greenteapress.com/thinkdsp/thinkdsp.pdf - A reference

Listening to Caltrain: Analyzing Train Whistles with Data Science

Acoustic Event Detection Using Machine Learning: Identifying Train Events

Manipulating Audio Data in Python

Turning Wav files into Samples

After you have some audio data and have done some preliminary analysis in Audacity, you can move on to more advanced analysis with python. The first step is to convert the wav file into a format you can manipulate.

from scipy.io import wavfile

def getRecordedFrames(fileName, Channels=1):
    sampFreq, frames = wavfile.read(fileName)
    lenData=float(len(frames))

    # Turn a stereo recording into mono
    if Channels>1:
        frames = frames[:,0]

    ## Uncomment the next line if you want an int16 data to be converted to floating values between -1 and 1
    #frames = frames / (2.0**15)                    #convert it to floating point between -1 and 1

    print('data type: ', frames.dtype)  # int16
    print('shape: ', frames.shape)  # (40000, 1) 40k sample points
    print('duration: ', (lenData / sampFreq))  # gives the duration of the sample
    print('lenData: ', lenData) # Number of samples

    return (frames, sampFreq)

Breaking Samples into Windows

Now you can decide on the size of the frames that you would like to look at. To just get dB values, you might just consider looking at frame sizes of .125 seconds. From the paper linked above, they used a frame size of 5 seconds to detect events. We will be implementing their Volume-based event detection (without overlapping windows) in a series of steps below.

To begin with, we need to break up the audio file into smaller bits.

frames, sampFreq = getRecordedFrames('C:\\Desktop\\6-9-17.wav')
lenData = float(len(frames))                            # Number of Samples
howManyChunks = math.floor((lenData/sampFreq)/5.0)      # (lenData/sampFreq)=Duration of recording. (Duration/WindowSize)=# of windows
whereToBreak = int(lenData/howManyChunks)               # Size of Windows in samples
frames_chunk=[]

longtermAvgVol = getDb(frames)                          # Get the average dB value of the entire recording
longtermAvgRms = getRms(frames)                         # Get the average RMS value of the entire recording

for i in range(int(howManyChunks)):
    frames_chunk = frames[i*whereToBreak:whereToBreak*(i+1)]

The above code will segment a wav file into chunks of 5 seconds each. This will now allow us to look at the parameters we care about. In this case, we will be looking at the volume in two different ways: RMS and decibels.

Peak Frequency Detection

There are a variety of ways that we can analyze audio data. One method is to look at the signal in the frequency domain. In this case we will be analyzing the peak frequency of a signal over windowed segments. Some events can be directly associated with a specific frequency or frequency range, such as a whistle blowing. If this frequency is unique in the environment that the node is in, it will be an indicator of the event. To perform this analysis, we use the following function:

from scipy.fftpack import fft, fftfreq

def getPeak(frames, rate=44100):
    # Find the peak in the coefficients
    signalFFT = fft(frames)
    freqs = fftfreq(len(signalFFT))
    idx = np.argmax(np.abs(signalFFT))
    freq = freqs[idx]
    peakFreqInHertz = abs(freq * rate)
    return peakFreqInHertz

RMS/dB Analysis

In the above code sample, when we called getDB and getRms, we are finding the average decibel and RMS values over the entire length of the audio. Note that these are the same quantities expressed in different ways. If we had a 10 minute audio file, this would be the average values over all 10 minutes. The functions to find the decibels and RMS are:

import numpy as np

longtermAvgVol = getDb(frames)                          # Get the average dB value of the entire recording
longtermAvgRms = getRms(frames)                         # Get the average RMS value of the entire recording

def getDb(frames):
    #This returns the RMS in dB
    frames_copy = frames.astype(np.float)
    dat = frames_copy / 32768.0
    magSq = np.sum(dat ** 2.0) / len(dat)
    dB = 10.0 * math.log(magSq,10.0)
    return dB

def getRms(frames):
    #This returns the normalized RMS value
    frames = frames / (2.0 ** 15)  # convert it to floating point between -1 and 1
    return np.sqrt(np.mean(np.square(frames)))

Once we have the average values for the entire recording, we can compare these to the values over a much smaller segment of time. To do this, we find the dB and RMS values over a 5 second window. We can then compare these window values to the values over the entire recording. In the case of a train, the idea is that there will be a long time where there is no train, which should have a lower energy/volume level. When a train approaches the node, it will bring a significant increase in volume that will be detected by the microphone. In the for loop we will get the values for the windowed period. If they are much greater (2 times) than the average levels from the entire recording, an event has been detected.

for i in range(int(howManyChunks)):
    frames_chunk = frames[i*whereToBreak:whereToBreak*(i+1)]
    decibel = getDb(frames_chunk)
    rms = getRms(frames_chunk)
        if rms>(2*longtermAvgRms):
        print(' Audio Event Detected ')

Example Car Detection

During the summer of 2017 we went to get audio data from Decatur multiple times. Most of this audio was just ambient noise while waiting for a train. However, due to our proximity to the road, it provided reasonable data for car audio analysis. Here is an overview of a 15 minute segment that we gathered on 6/9/17:

Most of the spikes that you see are cars passing by, with there being around 4 events that are marta trains. To check the validity of the audio event detection method we are using, I decided to use a small segment of this data that contained 7 cars to see if it detected them. Here is an overview of this segment.

You can see that there are around 7 dominant peaks. Just eyeballing the data it appears to me that an average car event lasts for around one second. Therefore, I changed the window size of the code to be 1 second and tried different values for the RMS multiplier, with 1.75x detecting 7 audio events. Here is where I heard the audio (circles) vs where the code recognized an event (shaded squares) above the threshold, the graph of the events from the code, and the output of the program:

This is just a preliminary analysis. Running the same code over a different chunk containing 7 cars only results in 4 cars being detected. Instead of looking just at segments in time, we should apply a window to the data to allow for events that overlap boundaries. It will also require additional tweaking of the parameters.

Expanding RMS Analysis

We saw from our previous example that the code was detecting events that were the same as two separate events, this is the product of not taking temporal locality into account. We also desired to see how applying an overlapping window over the event would change the outcome of the detection. In order to achieve these goals, we have modified the existing RMS detection code to implement these features.

For testing purposes we have implemented two different functions. The first function implements the RMS detection protocol from above, yet links events that are within one window of each other. We are not using a hamming window, nor are we doing overlapping analysis of the frames. The second function is the same as the first, with the exception of overlapping the windows by 50%. Here is the code for the second :

def rmsAnalysisOverlapping():
    '''
    Basic testing for volume event detection - Overlapping windows by 50%, events temporally linked
    '''
    MULTIPLIER = 1.75
    WINDOW_SIZE = .5
    eventCount=0

    #First file is for RMS refernece, Second file is for event analyszis
    backFrames, backSamp = getRecordedFrames('C:\\Users\\Gersemi\\Desktop\\sound_files\\6-9-17.wav')
    frames, sampFreq = getRecordedFrames('C:\\Users\\Gersemi\\Desktop\\sound_files\\6-9-17.wav')

    #Information on the audio file to analyze
    lenData = float(len(frames))                                    #Samples
    howManyChunks = math.floor((lenData/sampFreq)/WINDOW_SIZE)      #(lenData/sampFreq)=Duration of recording. (Duration/WindowSize)=# of windows
    whereToBreak = int(lenData/howManyChunks)                       #Size of Windows in samples

    #RMS threshold
    longtermAvgVol = getDb(backFrames)
    longtermAvgRms = getRms(backFrames)

    #Plot the horizontal line for RMS threshold and the signal in time
    f,ax = plt.subplots(2,figsize=(12,10), dpi=100, sharex=True)
    timeArray = np.linspace(0, (lenData/ sampFreq), lenData)
    ax[0].plot(timeArray, frames)
    ax[0].set_title('Audio Signal')
    ax[0].set_ylabel('Amplitude')
    timeArray = np.linspace(0, (lenData/ sampFreq))
    horizLine = np.array([MULTIPLIER*longtermAvgRms for i in range(len(timeArray))])   
    plt.plot(timeArray, horizLine, label='Detection Level: '+str(MULTIPLIER)+'x')

    #Window overlap - /2 == 50%
    seg_length = whereToBreak
    i, j = 0, seg_length
    step = seg_length / 2

    #To connect events that are temporally adjacent 
    oldi=0
    oldj=0

    frames_chunk=[]
    
    while j <= (howManyChunks*whereToBreak):
        framesChunk = frames[i: j]
        t = ((i/sampFreq) + (j/sampFreq)) / 2

        rms = getRms(framesChunk)
        plt.plot(t, rms, marker='.', color='k')

        if rms>(MULTIPLIER*longtermAvgRms):
            if i<oldj:
                #events are the same if within a window size of each other - connect with red line and draw red dot
                plt.plot(t, rms, marker='o', color='r')
                a=[[t, rms],[oldx,oldy]]
                plt.plot(*zip(*a), marker='o', linestyle='-',color='r')
            else:
                #Draw new red point for event
                eventCount += 1
                print('\n')
                logging.info('Event Detected ')
                logging.info('i time: "%s", j time: "%s"' % ((i/sampFreq),(j/sampFreq)))
                logging.info('rms: "%s", MULTIPLIER * longtermRMS: "%s"\n' % (rms, abs(MULTIPLIER*longtermAvgRms)))
                plt.plot(t, rms, marker='o', color='r')
                oldx=t
                oldy=rms
            oldi=i
            oldj=j
            
        i += int(step)
        j += int(step)

    plt.xlabel('Time')
    plt.ylabel('Volume - RMS')
    plt.title('Event Detection, Window Size: "%s", Events Detected: "%s"' % (WINDOW_SIZE, eventCount))
    plt.legend(loc='best')  
    plt.savefig('volumeWindowed.png', bbox_inches='tight')
    logging.info('eventCount: "%s"' % eventCount)

Here are the results of the changes:

As you can see, we still are only detecting 4 of the 7 car events using these parameters or otherwise stated, have an accuracy of ~57%. However, we are now correctly identifying unique events, whereas before we were not.

Using the volume analysis techniques over the entire audio file from 6-9-17, the number of events detected is 67 or 69 depending on if the implementation is the first or second function. Here are the graphs of the results.

By going back and labeling the recording data (read more in the audio classifier section), we have determined that there are ~117 car events in the entire clip. Making our best prediction function ~59% accurate.

Recording and Analyzing Live Data

The next step is to now get data live from the microphone and perform analysis on it. In python, this necessitates the use of pyAudio. The basic template to get data from the microphone is as follows:

import pyaudio

def getAudio():
    CHUNK = 512  # Changing the recording seconds might require a change to this value
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    RECORD_SECONDS = 5  
    CHUNKS_TO_RECORD = int(RATE / CHUNK * RECORD_SECONDS)

    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    print("* Recording audio...")

    frames = np.empty((CHUNKS_TO_RECORD * CHUNK), dtype="int16")

    for i in range(0, CHUNKS_TO_RECORD):
        audioString = stream.read(CHUNK)
        frames[i * CHUNK:(i + 1) * CHUNK] = np.fromstring(audioString, dtype="int16")

    print(frames)
    print("* done recording\n")

    print("closing stream")
    stream.stop_stream()
    stream.close()
    p.terminate()
    return frames

Building an Audio Classifier

In the previous section, our best results were only accurate ~59% of the time. A next step in making a more effective audio detection system is to train a machine learning classifier to build a model of a car's audio characteristics. There are various groups of general Machine Learning problems such as classification, regression, and clustering. We are interested in classification algorithms, as we wish to distinguish one class (cars) from any other class of audio data. Within classification, there are multiple algorithms that one can use such as logistic regression, Support Vector Machines, K Nearest Neighbors, or Decision Trees, among others. Understanding each particular algorithm would help, but is not as critical as understanding the process in which we use them.

Each algorithm (in the SKlearn library) takes in a numpy array of samples with their respective features: X, of size (n_samples, n_features) and a numpy array of class labels: y, of size (n_samples). Our goal is to extract samples of class "car" and samples of class "not car" from the audio data we have recorded. We will then extract the features of this audio that will (hopefully) be allow the classifier to distinguish between the two classes. We will then train the algorithm(s) based on the data we have extracted using the .fit function of that algorithm. After the algorithm fits a model to the data, this should allow it to take in new data and decide if it belongs to the class "car" or "not car". Read the following tutorial for a brief overview of machine learning.

This wiki page just goes over the details pertaining to the specific classifier that we have implemented. This leaves a lot of background information out of the picture that will be necessary for understanding the process. Here are some resources that are the basis for understanding the approach we took:

http://scikit-learn.org/stable/documentation.html - The Machine Learning library we are using

http://airccse.org/journal/ijsc/papers/4213ijsc03.pdf - Paper detailing the use of an audio classifier to detect and label types of cars

http://ijssst.info/Vol-15/No-5/data/5198a100.pdf - Paper detailing the most effective features for classification

http://www.vikparuchuri.com/blog/analyzing-audio-to-figure-out-which-simpsons-character-is-speaking/ - Using ML to determine a speaker from audio data (Used code from this for feature extraction)

http://www.speech.zone/exercises/digit-recogniser/speaker-dependent-system/collect-and-label-the-data/ - Site detailing how to label audio data correctly

http://aqibsaeed.github.io/2016-09-03-urban-sound-classification-part-1/ - Another ML example using tensor flow and librosa

Labeling the Data

In order to train a classifier, we needed labelled data. Therefore, we went back to the audio data that was captured earlier in the summer. We used wave surfer to label the data following this guide. This produced a file of labels. The class names were either 'car' or 'junk. The audio segment can then be split on these label boundaries. Unfortunately, this produces a group of files with numbers in front of the class label. Using a python script we renamed these to just the class name + instance number. Here is the code to do so:

filePath = 'C:\\Users\\...'
classifyPath = 'C:\\Users\\...'

def renameFiles():

    carC=0
    junkC=0
    martaC=0

    os.chdir(filePath)

    if not os.path.exists(filePath+'car'):
        os.mkdir(filePath+'\\car')
    if not os.path.exists(filePath+'junk'):
        os.mkdir(filePath+'\\junk')
    if not os.path.exists(filePath+'marta'):
        os.mkdir(filePath+'\\marta')
    for filename in os.listdir(filePath):
        
        print('Renaming: "%s"' % filename)
        if filename[-7:] == 'car.wav':
            newName = 'car'+str(carC)+'.wav'
            os.rename(filename, 'car\\'+newName)
            carC += 1
        elif filename[-8:] == 'junk.wav':
            newName = 'junk'+str(junkC)+'.wav'
            os.rename(filename, 'junk\\'+newName)
            junkC += 1
        elif filename[-9:] == 'marta.wav':
            newName = 'marta'+str(martaC)+'.wav'
            os.rename(filename, 'marta\\'+newName)
            martaC += 1

We now have a set of data that is n_samples (204 in this case) long. The next step is to extract feature vectors for each of the audio samples.

Feature Extraction

Feature extraction is the most confusing part of training a classifier. This is especially true in the case of audio data. It is unlike images or other types of data, where the actual samples make up (or are part of) the feature vector. In the digits tutorial linked above, the actual pixels that made up the digits were the feature that the model was fit upon. However, in our case of audio data, there is the additional step of extracting the characteristics. There can be a multitude of potential features for any given segment. We could choose to look at the RMS of a segment as a feature or the peak frequency of the segment (as shown above). This would give us a data vector of size (204, 2). Using the RMS of each sample was attempted with this data set, however it was not effective in training a classifier. Finding the optimal feature set for any given problem is a huge area of research in machine learning and some work has been done to help us out. Here is a paper of optimal features for audio data and here is an example of feature extraction being used for an audio classifier. In the second example, he is calculating a feature vector of 23 different attributes for any given sample. Since it needs just a bit of modification, we will use it to extract features from our audio. Note that we should go back in the future to find the most relevant features to save computation time.

def determineFeatures(className):
    dataFile = className+'Features.py'
    X = []
    y = []
    try:
        for fileName in os.listdir(classifyPath+className):
            frames, fs = mic.getRecordedFrames(classifyPath+className+"\\"+fileName)
            feature = calc_features(frames,fs)

            print('Working on file: "%s"' % fileName)
            X.append(feature)
            y.append(className)

        fileObj=open(dataFile,'a')
        fileObj.write('X = ' + pprint.pformat(X) +'\n')
        fileObj.write('y = ' + pprint.pformat(y) +'\n')
        fileObj.close()

    except:
        print("The data broke it")
def calc_slope(x,y):
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    x_dev = np.sum(np.abs(np.subtract(x,x_mean)))
    y_dev = np.sum(np.abs(np.subtract(y,y_mean)))

    slope = (x_dev*y_dev)/(x_dev*x_dev)
    return slope

def get_indicators(vec):
    mean = np.mean(vec)
    slope = calc_slope(np.arange(len(vec)),vec)
    std = np.std(vec)
    return mean, slope, std

def calc_u(vec):
    fft = np.fft.fft(vec)
    return np.sum(np.multiply(fft,vec))/np.sum(vec)

def calc_features(vec,freq):
    #bin count
    bc = 10
    bincount = list(range(bc))
    #framesize
    fsize = 512
    #mean
    m = np.mean(vec)
    #spectral flux
    sf = np.mean(vec-np.roll(vec,fsize))
    mx = np.max(vec)
    mi = np.min(vec)
    sdev = np.std(vec)
    binwidth = int(len(vec)/bc)
    bins = []
    for i in range(0,bc):
        bins.append(vec[(i*binwidth):(binwidth*i + binwidth)])
    peaks = [np.max(i) for i in bins]
    mins = [np.min(i) for i in bins]
    amin,smin,stmin = get_indicators(mins)
    apeak, speak, stpeak = get_indicators(peaks)
    #fft = np.fft.fft(vec)
    bin_fft = []
    for i in range(0,bc):
        bin_fft.append(np.fft.fft(vec[(i*binwidth):(binwidth*i + binwidth)]))

    cepstrums = [np.fft.ifft(np.log(np.abs(i))) for i in bin_fft]
    inter = [get_indicators(i) for i in cepstrums]
    acep,scep, stcep = get_indicators([i[0] for i in inter])
    aacep,sscep, stsscep = get_indicators([i[1] for i in inter])

    zero_crossings = np.where(np.diff(np.sign(vec)))[0]
    zcc = len(zero_crossings)
    zccn = zcc/freq

    u = [calc_u(i) for i in bins]
    spread = np.sqrt(u[-1] - u[0]**2)
    skewness = (u[0]**3 - 3*u[0]*u[5] + u[-1])/spread**3

    #Spectral slope
    #ss = calc_slope(np.arange(len(fft)),fft)
    avss = [calc_slope(np.arange(len(i)),i) for i in bin_fft]
    savss = calc_slope(bincount,avss)
    mavss = np.mean(avss)

    return [m,sf,mx,mi,sdev,amin,smin,stmin,apeak,speak,stpeak,acep,scep,stcep,aacep,sscep,stsscep,zcc,zccn,spread,skewness,savss,mavss]

Call the function determineFeatures with the class/folder name and let it run for quite some time. After it is done, you will have two python files: carFeatures.py and junkFeatures.py. Combine the X and y arrays and save it into one file called carAndJunkFeatures.py or download it here. If saving variable information to a python file is something new, read this for reference.

Training the Classifier

Now it is time to get to the actual machine learning portion. The algorithm that we will be using for the first attempt is a linear SVM. First import the ML algorithm and a tool to split up the feature data into a test and train set. This will be used to verify the general re-usability of the classifier.

from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

Now load the feature data that we had previously extracted, split it into their respective sets, and initialize the classifier.

import carAndJunkFeatures

X = np.array(carAndJunkFeatures.X)
y = np.array(carAndJunkFeatures.y)

testSize = .3
randomState=1
   
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=testSize, random_state=randomState)

clf0 = svm.SVC(kernel='linear')

You can read about the different SVM kernels to get an idea about the other options out there. For this case we will be using the linear kernel. This algorithm will only converge if the two classes are linearly separable. Now we need to get the algorithm to fit a model to the data:

clf0.fit(X_train,y_train)

This gives the feature data with the corresponding correct label to the algorithm. This is what allows it to build the model, by fitting the characteristics of the feature data to the correctly labeled class. This particular algorithm can take in classes that have strings as their names, but note that for some you have to convert the class labels to ints. After the model has been constructed, we can test its accuracy by testing it on data without telling it the class information for the data and comparing its predictions to the actual class.

pprint.pprint('linear SVM')
predictions = clf0.predict(X_test)
pprint.pprint('Predictions:'), pprint.pprint(np.array(predictions))
pprint.pprint('Ground Truth:'), pprint.pprint(np.array(y_test))
        
predVsTruth=predictions==y_test        
pprint.pprint(predVsTruth)
numCases =(len(predictions))
numTrue = np.sum(predVsTruth)
numFalse = numCases - numTrue
print('Accuracy is: "%s"' % (numTrue/numCases*100))
print('Number True: "%s", Number False: "%s"\n\n' % (numTrue,numFalse))

This is the output of y_test on the classifier:

We can now try the same system using different algorithms. Here are the following results:

Note that with a different kernel, the algorithm was not able to separate the boundaries of the two classes.

This is using the a weighted sum of the probabilities of the three previous outputs to determine the class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly