TOEIC Link Listening — Prosody and Connected Speech Decoding: The Phonological Layer That Separates Band-21 Transcription Listeners From Band-23 Discourse Listeners

The band-21 listening candidate fails on natural-speech audio for a reason that is phonological rather than lexical. The candidate has the vocabulary and grammar to understand each word in citation form, but the candidate is processing the audio stream as if it were a sequence of citation-form words pronounced in isolation, and the audio stream contains none of those features. Natural speech compresses sequences of words through connected-speech reductions, redistributes the listener's attention through prosodic stress and intonation contour, and signals discourse structure through phonological cues that arrive before the lexical content. The band-21 candidate is decoding the audio in a model that the audio does not fit, which produces a characteristic failure pattern — the candidate can transcribe slow, clearly-articulated audio with high accuracy but fails on the natural-speech items that the band-23 rubric targets.

The band-23 candidate is operating with a different phonological model. The candidate decodes prosodic stress and intonation contour in real time, applies connected-speech reduction rules in reverse to recover the underlying word sequence, and uses the phonological cues to predict discourse structure before the propositional content arrives. The prediction is what makes the band-23 listening accuracy achievable on natural-speech items — the candidate is no longer racing the audio to extract meaning lexically but is using the phonology to compress the meaning extraction into the time the audio actually provides. This guide formalizes the four prosodic features, the three connected-speech reduction families, and the installation drill that produces transfer to test-equivalent listening.

Why the phonological layer is the binding constraint on natural-speech listening

The TOEIC Link listening section assesses listening comprehension across audio that ranges from short conversational exchanges to extended monologues, and the audio is engineered to approximate natural speech tempo, prosody, and connected-speech patterns. The score profile across candidate populations shows that the band-21 ceiling is reached primarily on the natural-speech items, and within those items, the binding constraint is the candidate's ability to recover the discourse-relevant content from a phonological signal that does not present the words in citation form.

The cognitive mechanism is the mismatch between the candidate's stored lexical representations and the surface phonological forms in the audio. The candidate has learned vocabulary in citation form — the dictionary pronunciation of each word as an isolated unit — and the candidate's lexical access process is optimized for matching incoming acoustic signal against citation-form templates. Natural speech does not produce citation-form signal; it produces connected-speech signal that has been compressed, reduced, and prosodically shaped. The lexical-access process either fails to match the signal or matches it after a delay that consumes the time budget available for downstream comprehension, and the resulting accuracy loss produces the band-21 ceiling on natural-speech items.

The phonological decoding discipline addresses the mismatch by installing a reverse-mapping layer that recovers the citation-form lexical sequence from the connected-speech signal in real time. The candidate who has installed the reverse-mapping layer can match incoming natural-speech signal against citation-form templates without the delay, and the time saved by the reverse-mapping flows into downstream comprehension, which is what the rubric is scoring at the band-23 level.

The four prosodic features

Prosodic features in TOEIC Link listening audio fall into four categories that have to be decoded separately because they perform different discourse functions and the comprehension questions target each feature with characteristic question patterns.

Feature 1 — Lexical stress

Lexical stress is the relative prominence assigned to syllables within multi-syllable words, and the lexical-stress pattern is a primary cue to word identity in natural speech. Many minimal pairs in English are distinguished primarily by lexical stress — the noun-verb pairs like record and record, present and present — and the candidate who does not decode lexical stress will misidentify the word category and propagate the error downstream into the syntactic parse.

Feature 2 — Sentence stress

Sentence stress is the relative prominence assigned to words within sentences, and the sentence-stress pattern signals the information structure of the utterance — which content is new, which is given, and which is contrastive. The comprehension questions that target sentence stress are typically asking about the speaker's emphasis, contrast, or focus, and the candidate who does not decode sentence stress will treat all content words as equally informative and miss the rubric-scored emphasis distinction. For the related question-stem mapping discipline, see the reading question stem keyword mapping guide.

Feature 3 — Intonation contour

Intonation contour is the pitch trajectory across an utterance, and the contour signals the utterance's discourse function — statement, question, confirmation-request, list-continuation, finality. The comprehension questions that target intonation are typically asking about the speaker's attitude, the question vs statement distinction, or the boundary between list items, and the candidate who does not decode intonation contour will lose the discourse-function distinction and answer the question against the lexical content alone.

Feature 4 — Tempo and pause structure

Tempo and pause structure is the temporal organization of the utterance, and the pause placement signals syntactic boundaries, discourse-segment boundaries, and turn-taking points in dialogues. The comprehension questions that target tempo and pause are typically asking about the discourse organization, the relationship between successive utterances, or the conversation's turn structure, and the candidate who does not decode the temporal layer will mis-segment the audio into syntactic chunks that the rubric does not recognize.

The three connected-speech reduction families

Connected-speech reductions in natural English fall into three families that the candidate has to recognize in reverse to recover the citation-form lexical sequence.

The first family is linking — adjacent words connect across the word boundary, with consonant-vowel linking producing a continuous syllable across the boundary and vowel-vowel linking inserting a glide that the candidate has to discount when recovering the underlying word sequence. The second family is elision — unstressed syllables, function-word vowels, and adjacent-consonant sequences collapse or delete, which produces surface forms that are shorter than the citation-form templates and that the candidate has to expand in reverse to recover the underlying words. The third family is assimilation — adjacent sounds across the word boundary become more similar, with place-of-articulation assimilation being the most common pattern and producing surface forms whose articulation differs systematically from the citation-form templates. For the speaking-side counterpart of these reductions, see the speaking strategic pausing and cognitive load distribution guide.

The four-week installation drill

Phonological decoding is a discipline that has to be installed through deliberate practice because the reverse-mapping layer is initially absent and the candidate's lexical-access process is initially calibrated against citation-form templates only. The four-week drill below produces transfer to test-equivalent listening.

Week one focuses on lexical stress and sentence stress decoding because these are the prosodic features with the most direct lexical-access consequences. The candidate practices minimal-pair stress discrimination on noun-verb pairs and emphasis-marked sentences, with explicit marking of the stressed syllables and words during slow-tempo listening. The exit criterion is the candidate's ability to decode lexical and sentence stress at natural tempo across five consecutive items without conscious search.

Week two adds intonation contour decoding. The candidate practices on dialogue audio where intonation distinguishes question vs statement, confirmation vs assertion, and list-continuation vs finality. The exit criterion is the candidate's ability to predict the speaker's discourse function from intonation contour alone, before the lexical content arrives, across three consecutive dialogue items.

Week three adds the connected-speech reduction reverse-mapping. The candidate practices on natural-speech audio with explicit annotation of linking, elision, and assimilation events, and reconstructs the citation-form lexical sequence from the surface signal. The exit criterion is the candidate's ability to recover the underlying word sequence from natural-speech signal at test-equivalent tempo across five consecutive items.

Week four moves the full phonological decoding stack into timed practice. The candidate runs the complete decoding discipline under test-equivalent timing and measures the comprehension-question accuracy on the natural-speech items. The exit criterion is a measurable improvement in natural-speech accuracy across three consecutive practice sessions, with particular attention to the multi-speaker discrimination items where the phonological layer compounds with speaker-tracking load. For the speaker-tracking layer that interacts with phonological decoding, see the listening multi-speaker discrimination and tracking guide.

How phonological decoding interacts with the rest of the listening module

Phonological decoding is the perceptual sub-skill that complements the lexical and syntactic sub-skills the candidate brings to the test. The lexical and syntactic sub-skills determine the candidate's accuracy on slow, clearly-articulated audio; the phonological sub-skill determines the candidate's accuracy on the natural-speech audio that the rubric scores at the band-23 level. The two sub-skill layers are largely independent, and the band-21-to-23 transition almost always requires the phonological layer to be installed because the lexical and syntactic layers have typically already reached their ceilings at the band-21 level.

The candidate who installs phonological decoding without the surrounding listening sub-skills will see a structural shift on the natural-speech items while the slow-tempo scores remain stable. The candidate who installs phonological decoding alongside the multi-speaker discrimination discipline and the question-stem preview discipline sees the phonological layer fully integrate with the listening-comprehension scoring frame, and the natural-speech items move from being the binding constraint on the listening score to being a strength that lifts the overall listening score above the citation-form ceiling. The phonological-decoding layer is the most leverage-rich installation target in the listening module at the band-21-to-23 boundary.