TOEIC Link Speaking — Pronunciation Self-Assessment: The Six Phonetic Categories That Decide Speaking Sub-Scores

The TOEIC Link Speaking section is scored against multiple sub-criteria that include content, grammar, vocabulary, and pronunciation. For candidates who plateau at the mid-band Speaking scores, pronunciation is the most frequent constraint — not because the candidate's overall pronunciation is poor, but because one or two specific phonetic categories drag the sub-score down even when the other categories are adequate. The remediation is not generalized pronunciation practice. The remediation is a disciplined self-assessment that decomposes pronunciation into six phonetic categories, identifies which one or two are the constraint for the individual candidate, and produces a targeted practice plan that addresses only the constraining categories.

This guide presents the six-category framework, describes the diagnostic procedure for each category, and provides per-category practice priorities. For related Speaking topics, see the guides on katakana-to-English pronunciation fix, sentence stress and rhythm for listening, and shadowing method for listening.

Why generalized pronunciation practice underperforms targeted practice

Three properties of pronunciation development make generalized practice an inefficient use of preparation time.

Property 1 — pronunciation problems are individual. Each candidate brings a distinct set of pronunciation challenges based on the candidate's first-language phonology, the candidate's prior English exposure, and the candidate's articulatory habits. A Japanese first-language candidate has different pronunciation challenges from a Chinese first-language candidate or a Korean first-language candidate. Within each first-language group, individual candidates also differ — one candidate struggles with /l/-/r/ contrast, another with vowel reduction in unstressed syllables, another with sentence-final intonation. Generalized practice treats all candidates as if they have the same pronunciation profile, which dilutes practice time across categories that may not be the candidate's constraint.

Property 2 — sub-score gains are non-linear in category-specific effort. A candidate who improves the constraining category by one band typically gains one full pronunciation sub-score. A candidate who improves a non-constraining category by one band typically gains nothing on the sub-score, because the constraining category remains the binding limit. The implication is that practice time on the constraining category is roughly five to ten times more sub-score-productive than practice time on a non-constraining category. Identifying the constraint is therefore the highest-leverage activity in pronunciation preparation.

Property 3 — self-assessment is feasible without an instructor. Each of the six phonetic categories has objective diagnostic signs that the candidate can observe in a self-recorded speech sample. The diagnostic procedure does not require a phonetics expert or a teacher; it requires a recording device, a thirty-minute practice session, and a checklist. The accessibility of self-assessment removes the most common barrier to targeted pronunciation practice — the perception that an instructor is required to identify the constraint.

The six phonetic categories

Category 1 — Segmental accuracy

Segmental accuracy is the correct production of individual consonant and vowel sounds — /l/-/r/, /v/-/b/, /s/-/sh/, /θ/-/s/, /æ/-/ʌ/, and similar contrasts. Segmental errors are the most visible pronunciation problems and are often the first category that candidates work on. They are also the easiest category to diagnose because the errors are localized to specific sounds and can be identified by comparing the candidate's production to a native-speaker model.

Diagnostic procedure. Record yourself reading aloud a passage that contains the target contrasts (a 200-word passage typically suffices). Listen to the recording and mark each instance where the produced sound differs from the target sound. Tabulate the errors by sound category. Segmental accuracy is a constraint if the per-category error rate exceeds roughly 20% for one or more contrasts.

Practice priority. Segmental practice is most efficient when the candidate focuses on the two or three highest-frequency error contrasts rather than attempting to address all segmental errors simultaneously. Minimal-pair drills, in which the candidate alternates between contrasting words (light–right, very–berry, sink–think), are the standard practice format.

Category 2 — Sentence stress

Sentence stress is the correct placement of primary and secondary stress within multi-syllable words and within multi-word phrases. English is a stress-timed language in which content words carry stress and function words are typically unstressed. Candidates whose first language is syllable-timed (Japanese, French, Korean) often produce English with insufficient stress contrast, which reduces intelligibility and is heard by raters as "robotic" or "syllable-by-syllable."

Diagnostic procedure. Record yourself reading aloud a sentence-stress diagnostic such as "The product launch is scheduled for the third quarter." Listen to the recording. Sentence stress is a constraint if the function words ("the," "is," "for," "the") carry stress equal to or greater than the content words ("product," "launch," "scheduled," "third quarter").

Practice priority. Sentence-stress practice is most efficient when the candidate marks the stress pattern explicitly on the script before recording, produces the marked version, compares to a native-speaker model, and iterates. The deliberate marking step is what produces transfer to spontaneous speech; reading without marking typically does not transfer.

Category 3 — Prosody (intonation contours)

Prosody is the pitch contour over a phrase or sentence. English uses falling intonation at the end of statements and wh-questions, rising intonation at the end of yes/no questions, and a variety of nuanced contours for lists, contrasts, and continuations. Candidates whose first-language prosody differs from English prosody often produce flat or inappropriate contours that are heard by raters as "monotone" or "unnatural."

Diagnostic procedure. Record yourself reading aloud a prosody diagnostic that includes statements, yes/no questions, wh-questions, and a list. Listen to the recording. Prosody is a constraint if the intonation contours are flat across sentence types or if the contours are inappropriate for the sentence type (rising at the end of a statement, falling at the end of a yes/no question).

Practice priority. Prosody practice is most efficient when the candidate uses an audio model — a native-speaker recording of the diagnostic passage — and shadows the model with attention to the pitch contour rather than to the segmental content. Shadowing for prosody is a different activity from shadowing for vocabulary; the focus is on the pitch pattern rather than on the words.

Category 4 — Connected speech (linking, reduction, contraction)

Connected speech is the set of phonological processes that occur at word boundaries in natural English — linking (the consonant at the end of one word attaches to the vowel at the start of the next word, as in "an apple" → "anapple"), reduction (unstressed vowels reduce to schwa, as in "for" → /fər/), and contraction (auxiliary verbs combine with subjects, as in "I am" → "I'm"). Candidates who produce English in fully separated, fully pronounced words are heard by raters as "careful" or "textbook" rather than as fluent.

Diagnostic procedure. Record yourself reading aloud a connected-speech diagnostic that includes high-frequency phrases ("a lot of," "going to," "want to," "kind of"). Listen to the recording. Connected speech is a constraint if the phrases are produced with full vowels and clear word boundaries rather than with reduction and linking.

Practice priority. Connected-speech practice is most efficient when the candidate practices the highest-frequency reduced phrases as fixed chunks — "gonna," "wanna," "lotta," "kinda" — rather than attempting to apply reduction rules in real time. Chunk-based practice transfers to spontaneous speech more reliably than rule-based practice.

Category 5 — Intelligibility

Intelligibility is the listener's ability to understand the candidate's speech without effort. Intelligibility is influenced by all of the previous four categories but is also influenced by articulatory clarity (whether sounds are produced with sufficient energy and precision), by volume (whether the speech is loud enough to be heard clearly), and by the absence of muttering or trailing off.

Diagnostic procedure. Record yourself producing a one-minute spontaneous response to a TOEIC Link Speaking prompt. Listen to the recording with the playback volume normal. Intelligibility is a constraint if you have difficulty understanding your own recording, if any words are inaudible, or if the recording becomes quieter toward the end of phrases.

Practice priority. Intelligibility practice is most efficient when the candidate practices producing the speech sample at a deliberate volume and with deliberate articulatory energy, then gradually reduces the deliberate effort until the natural production is at adequate volume and clarity. The deliberate-then-reduce sequence is more effective than attempting to produce natural speech with adequate clarity directly.

Category 6 — Rate

Rate is the speed of speech in syllables per second or words per minute. The target rate for TOEIC Link Speaking is roughly 130 to 160 words per minute — a moderate pace that is faster than slow careful speech but slower than fast conversational speech. Candidates who speak too slowly (below 100 wpm) sound hesitant; candidates who speak too quickly (above 180 wpm) sacrifice intelligibility and prosody.

Diagnostic procedure. Record yourself producing a one-minute spontaneous response. Count the words. Rate is a constraint if the count is below 100 or above 180 over a one-minute sample.

Practice priority. Rate practice is most efficient when the candidate uses a metronome or a pacing audio at the target rate (140 wpm is a useful midpoint) and produces speech that matches the pacing reference. The pacing reference is more reliable than self-judgment because candidates are poor at estimating their own speech rate.

How to run the full self-assessment

A complete self-assessment takes roughly 45 minutes and requires only a recording device.

Step 1 — record the diagnostic battery (15 minutes). Read aloud the five diagnostic passages (segmental, stress, prosody, connected speech, intelligibility) and produce one spontaneous response (rate and intelligibility). Save the recordings.

Step 2 — score each category (20 minutes). Apply the diagnostic procedure for each of the six categories. Score each category as "adequate" or "constraint."

Step 3 — identify the top two constraints (5 minutes). If only one or two categories are flagged as constraints, those are the practice priorities. If three or more categories are flagged, prioritize the two with the most severe diagnostic signs.

Step 4 — produce a focused practice plan (5 minutes). Allocate 80% of the available practice time to the top two constraining categories and 20% to maintenance of the adequate categories.

A candidate who runs the self-assessment once, identifies the top two constraints, and practices the constraints for two weeks typically gains one full pronunciation sub-score. The gain is concentrated in the constraint categories, with no detectable change in the non-constraint categories.

For complementary Speaking guides, see Katakana to English Pronunciation Fix, Sentence Stress and Rhythm for Listening, Shadowing Method for Listening, and Speaking and Writing Tips.