ABSTRACT
Recordings involving cellular telephones or personal digital assistants ("PDAs") are increasingly the source evidence in audio forensic examinations, compared to recordings originating with other devices such as hand-held digital recorders. On modern PDA cellular telephones recordings can be made either directly to the telephone or transmitted as voice mail messages. The current investigation focuses on differences in the two types of recordings in terms of dynamic range and linearity of levels. Such information can be important for characterizing the distance of sound sources relative to the microphone and are important for understanding transformation of recorded speech and non-speech sounds.
1. INTRODUCTION
Recordings involving cellular telephones or personal digital assistants ("PDAs") are increasingly the focus of audio forensic examinations, while recordings
originating with other devices such as hand-held digital recorders have decreased as these devices are supplanted by PDAs, much as what has happened with consumer-level cameras. On modern PDA cellular telephones, recordings can be made either directly to the telephone, or transmitted as voice mail messages to a particular messaging system. It becomes immediately obvious that the quality of such recordings varies widely between messaging systems and between built-in recording applications;furthermore, the sensitivity of recordings to different sound source levels varies greatly, due to codecs and/or telephonic transmission. An understanding of these transformational properties has important implications for forensic analyses that include speech transcription of "background" voices, gunshots, distance of sound sources, and environmental context analysis (sometimes referred to as "roomprints") [1, 2].
When made directly to the telephone, the recording function is similar to a traditional hand-held digital recorder. For example the native voice memo recording application on the iPhone 5s writes .m4a files that, while technically written using a lossy codec, are both wideband and decent dynamic range (sampling rate 44.1 kHz, bitrate 64 Kbps). A particular application from a third party may allow alternative means of export (but not necessarily recording). The use of the iPhone microphone as input to a sound level meter application has also been evaluated in the literature [3].
When a recording is saved as a voicemail message, either on another telephone, on a corporate voice mail server, or on an emergency ("911") call log recording system, several additional and varied stages of signal processing can be involved. These stages include algorithms designed to optimize the speech signal against background noise, including compression and voice activity detection algorithms. Most importantly, these stages of signal processing alter the spectrum of speech as a function of level.
The current investigation is necessarily limited in scope, focusing primarily on tests using an iPhone 5s in our laboratory using an AT&T carrier (GSM 4G) in the San Francisco Bay Area. Analyses of actual gunshots from a different older-model cellular telephone are provided in a final section (details have been omitted due to privacy concerns). We have approached understanding of effects from the standpoint of a "black box" analysis, where we provide detailed description of the input and resulting output. The analyses focus on differences in types of recordings in terms of dynamic range and linearity of levels. Such information can be important for characterizing the distance of sound sources relative to the microphone and are important for understanding the impact of recorded speech and non-speech sounds in forensic settings.
An admitted but not fatal limitation of this study is a lack of certainty regarding the technical aspects of the hardware involved, and how the signal was transformed by individual signal processing elements in the communication chain of the cellular telephone system, including those features caused by linear predictive coding, discontinuous transmission, voice activity detection, and the inclusion of so-called "comfort noise" (see e.g. [4] for an informative overview). In other words, we observe what occurs to the signal and its implications for forensic analysis, without analyzing the specific causes.
The frequency response characteristics of the voice mail recordings (AMR file structure) indicate a classic telephony narrow band, fixed rate with a cutoff at around 4k consistent with (but not necessarily indicating) a G711 mu-law codec. There are 3 microphones in the iPhone, one at the bottom, one next to the speaker above the screen, and one between the flash and the lens of the camera on the rear of the phone. Investigation indicates that different applications may access different microphones to optimize the functionality. For example, when using the front facing camera for recording video, the front facing microphone is
active. This microphone is deactivated when the camera is switched to rear facing, and the rear facing microphone becomes active. The voice memo appears to use all three microphones to varying degrees, with the main (bottom) mic being the main source. When making voice calls, only the bottom mic is actively transmitting speech, however the iPhone features a noise cancellation control system and it is believed that at least one additional microphone is used for noise cancellation during voice calls when this feature is enabled.
. . .Continue to read rest of article (PDF).