TOEIC Link Microphone and Audio Environment Monitoring: How the Proctor Listens to Your Room

The microphone on a TOEIC Link session is doing two jobs at once. During the Speaking module it is the input device that captures your responses and sends them to the scoring engine. Across the entire session — including modules where you are not speaking — it is also a sensor that the proctoring agent uses to confirm that you are alone in the room, that no second voice is coaching you, and that no audio cues from a phone or another device are reaching you mid-test. Most test-takers know the microphone is on during Speaking. Far fewer know it is on during Listening, Reading, and Writing.

This guide explains the four audio signals the proctoring agent monitors, which ambient sounds are normal versus suspicious, and how to set up a room that produces a clean session log without expensive equipment.

What the proctoring agent monitors via the microphone

The agent collects four signals from the microphone during every TOEIC Link session: voice activity, second-voice detection, ambient-sound classification, and audio-cue detection. Each one runs continuously throughout the session, regardless of which module the test-taker is currently in.

Voice activity is the foundation. The agent runs a voice-activity detector (VAD) on the incoming microphone signal to identify segments where human speech is present. During the Speaking module, voice activity is expected and is matched against the prompt timing — the agent confirms that the test-taker spoke when expected and stayed silent when expected. During Listening, Reading, and Writing, voice activity is unexpected, and any sustained vocal segment longer than a few seconds is logged for review. A test-taker reading a passage aloud to themselves during Reading is a normal but logged event; a sustained conversation in the test-taker's voice during Writing is a flag.

Second-voice detection runs on top of voice activity. The agent does not perform full speaker identification — it does not need a voice fingerprint of the test-taker to know that a second voice is present. The agent compares the spectral characteristics of the active voice against a baseline established during the registration step (when the test-taker reads a short consent statement aloud) and flags segments where a different speaker's characteristics are detected. The flag is conservative: a brief overlap from someone in the next room is usually distinguishable from a coaching attempt because the second voice is fainter and is not synchronized with the test-taker's responses. A clear, full-volume second voice that interleaves with the test-taker's responses is the canonical coaching signature.

Ambient-sound classification handles non-speech audio. The agent runs a small classifier on the incoming signal to bucket ambient sounds into categories: traffic noise, HVAC hum, keyboard typing, background music, television, phone notifications, dog barking, baby crying, doorbell, and a handful of others. Most of these are recorded but not flagged — a quiet room is rarely available, and the proctoring stack accepts that. What gets flagged are categories that suggest an active distraction or a hidden audio source: a television in the room, a phone notification chime, music with intelligible lyrics, or a doorbell that the test-taker leaves the room to answer.

Audio-cue detection is the most specialized signal. The agent listens for short, repeating audio patterns that could be a coaching system delivering hints — beeps, clicks, or short tonal sequences that pattern over time. The detector is tuned to be conservative because phone notifications, smart-home device chimes, and even some household appliances produce repeating tonal sounds. A flag from this detector is rare and almost always triggers human review rather than an automatic action.

Which ambient sounds are normal versus suspicious

Audio activity during the test session falls into four buckets, and the agent treats them very differently.

Normal: low-level ambient noise. Traffic outside, HVAC hum, the test-taker's own keyboard and mouse, fan noise from the laptop, the test-taker shifting in their chair, and similar low-amplitude sounds are normal. The agent records them but does not flag them. A perfectly silent room is almost impossible outside a recording studio, and the proctoring stack does not require one.

Normal: brief interruptions. A dog barking once in another room, a doorbell ringing, a delivery driver shouting, a child speaking briefly nearby, or a notification chime that the test-taker did not silence are recorded and noted but not automatically flagged. The agent's heuristic is that brief, isolated audio events are part of normal home environments and do not affect the test integrity. The exception is if a brief event is followed by a sustained audio sequence — a doorbell followed by a conversation — at which point the sustained sequence triggers a review.

Suspicious: sustained second voice. A second voice that is present for more than a few seconds and is at a volume comparable to the test-taker's voice is the highest-priority flag the audio layer produces. The reviewer correlates the second-voice timestamps with the test interface activity to determine whether the second voice was coaching (synchronized with the test-taker's responses) or unrelated (the test-taker is in a coworking space and someone is on a phone call nearby). Coaching responses lead to a voided score; unrelated speech usually leads to a request for a quieter room on the next attempt.

Suspicious: structured audio cues. Repeating tonal patterns that the agent's audio-cue detector flags are escalated to human review. The base rate is low, and most flags are false positives from smart-home devices or phone notifications. But a structured cue pattern that correlates with answer-selection events on the test interface is the canonical cheating signature for audio coaching, and reviewers know what to look for.

Headphones, headsets, and earbuds

The TOEIC Link Listening module is delivered via the device's audio output, and the test-taker can use either built-in speakers or headphones. Most test-takers use headphones — the audio quality is better, there is less ambient leak, and the listening passages are easier to follow. The proctoring agent does not require any specific device.

What the agent does check is whether the audio is being routed to a hidden output. A Bluetooth headset that is paired but not in the camera frame is suspicious because the test-taker could have a second person speaking through it. The pre-test environment scan asks the test-taker to show all headphones and earbuds in the camera frame and to confirm that no Bluetooth audio device is paired and out of view. The agent reads the operating system's Bluetooth device list and flags any audio device that is connected but not visible.

Wired headphones are simpler. The agent confirms that the audio output is routed to the headphone jack, and the camera scan confirms that the headphones are in frame. There is no mechanism for a hidden coach to deliver audio through a wired headphone unless the wire goes somewhere out of frame, which the room scan is designed to catch.

How to set up a quiet room without overspending

The audio layer's failure mode is rarely the test-taker spending too little on equipment — it is the test-taker not anticipating which sounds will be present during the session. A few minutes of preparation eliminates the majority of false-positive flags.

Silence phones and smart-home devices. Phones in Do Not Disturb mode still emit emergency-bypass tones. Set the phone to airplane mode and turn it off if possible. Smart speakers (Alexa, Google Home, HomePod) should be unplugged or muted; they sometimes respond to test audio that contains words near their wake words. Smart displays should be powered off.

Schedule around predictable interruptions. Garbage collection on Tuesday morning, lawn maintenance on Saturday afternoon, and school dismissal in the early afternoon all produce predictable ambient noise. Schedule the test outside these windows when possible.

Notify housemates. A 90-minute test session with no warning to housemates is the most common source of "sustained second voice" flags from family members talking nearby without realizing the test was in progress. A simple note on the door and a heads-up before starting eliminates most of these flags.

Close the door. A closed door reduces ambient noise from elsewhere in the home by 10-15 dB and is more effective than any consumer-grade noise-canceling headphone. The pre-test environment scan includes closing the door, and the agent records whether it was closed at the start of the session.

Avoid coffee shops and coworking spaces. Public locations almost always produce sustained-second-voice flags from nearby conversations and music with lyrics. The proctoring service formally permits these locations but pragmatically discourages them. Choose a private room.

The full set of pre-test preparation steps is in the TOEIC Link test day checklist, which covers the audio layer alongside the camera, network, and clipboard layers.

How audio flags affect your score

An audio flag during the test session goes into the session log along with the timestamp, the audio category, and the duration. Most flags do not void the score on their own — they trigger human review, and the reviewer correlates the audio activity with the test interface activity to decide whether the flag is consistent with cheating.

The relevant context for the reviewer is whether the audio activity correlates with answer-selection behavior. A second voice that says nothing for two minutes and then speaks immediately before each answer is selected is the canonical coaching pattern, and the reviewer voids the response. A second voice that talks continuously without correlation to answer selection is treated as ambient interference and the score is usually released. The audio layer cooperates with the screen-recording layer and the network-traffic layer — corroborating signals across layers strengthen a flag, and absence of corroboration weakens it.

The flagged sessions that result in voided scores are a small fraction of the flagged sessions overall, because the false-positive base rate is high (most home environments produce some flaggable audio at some point during a 90-minute session). The flagged sessions that result in released scores after review are the majority. Test-takers who arrive at a quiet room with phones silenced, housemates notified, and no smart-home devices active will rarely see an audio flag at all.

The microphone is on for the entire session, not just for Speaking. Treating the room as if it were a recording studio for that full window is the test-taker's job, and it is one of the cheaper preparations the proctoring stack asks for.