TOEIC Link Speaking — Picture Description Structure: How Overview, Detail Selection, and Inference Sequencing Determine Your Speaking Part 3 Score

TOEIC Link Speaking Part 3 asks the candidate to describe a photograph in thirty to forty-five seconds, and the rubric rewards a specific descriptive sequence that many candidates do not produce instinctively. This guide maps the four-move picture description structure, the three detail-selection heuristics that produce level-3 and level-4 responses, and the four observation traps that depress otherwise fluent responses.

EnglishBlitz EditorialTeam·

TOEIC Link Speaking — Picture Description Structure: How Overview, Detail Selection, and Inference Sequencing Determine Your Speaking Part 3 Score

TOEIC Link Speaking Part 3 presents the candidate with a photograph — typically a workplace scene, a public-space scene, or a commercial-activity scene with multiple people and objects — and asks the candidate to describe it within thirty to forty-five seconds of response time. Candidates frequently approach the task as a vocabulary-and-grammar exercise and concentrate their preparation on lexical range (specific nouns for objects, accurate prepositions of place, present-progressive tense for ongoing actions). The lexical-and-grammatical dimensions matter, but the scoring rubric weights descriptive structure equally, and a vocabulary-rich response that omits the overview move or fails to sequence details coherently will be capped at level 2 (out of 4) regardless of its surface accuracy.

This guide describes the four-move picture description structure graders expect in Part 3, the three detail-selection heuristics that reliably produce level-3 and level-4 responses, and the four observation traps that depress otherwise fluent responses. The material applies primarily to Speaking Part 3 (describe a picture); the more open-ended response tasks in Parts 4 through 6 are addressed separately. For related speaking topics, see the guides on speaking fluency and hesitation recovery and on speaking pronunciation self-assessment.

Why descriptive structure matters as much as vocabulary

A Speaking Part 3 grader is listening to thirty to forty-five seconds of speech and forming an assessment along multiple rubric dimensions — pronunciation, fluency, grammar, vocabulary, and descriptive coherence. The descriptive-coherence dimension is the one most candidates underweight. The dimension asks whether the response presents the picture in a way that an unseeing listener could reconstruct — that is, whether the response establishes the scene first, then enumerates details in a logical sequence, and then concludes with an appropriate framing.

A response that lists details without establishing the scene first fails the coherence dimension. The listener cannot connect the details to a unifying image and codes the response as fragmented. A response that establishes the scene but jumps randomly between unrelated details (a chair in the foreground, then the ceiling, then a person's clothing, then the floor) also fails. A response that produces a coherent scene-and-details narrative but omits an inference move (a brief comment on what the people might be doing or what the situation appears to be) reads as observation-only and is bounded at the middle band.

Three implications follow.

Implication 1 — the overview comes first, always. A response that opens by naming the most prominent object or person without first establishing the scene type ("this is a photograph of an office," "this is a picture of a busy street," "this is an image of a restaurant") fails the structural-coherence test. The overview is a forty-character commitment that anchors the rest of the response.

Implication 2 — detail sequencing must be deliberate. A response that selects details in an undisciplined order signals weaker visual-analytic skills. The high-band sequencing patterns (foreground-to-background, left-to-right, center-then-periphery, person-then-object) are not stylistic preferences; they are scoring signals.

Implication 3 — an inference move distinguishes level 4 from level 3. A response that observes the picture without inferring what is happening, what the people might be discussing, or what the setting suggests will plateau at level 3. A brief inference move at the end of the response is the rate-limiter for level-4 performance.

The four obligatory moves

Every Speaking Part 3 response should execute four moves in approximately this order. The moves are derived from the descriptive-genre conventions taught in academic and professional contexts and confirmed by analysis of high-scoring TOEIC Link sample responses.

Move 1: Scene overview

The opening move names the scene type and, where appropriate, the location and the time-of-day context. The move anchors the listener's mental image and signals to the grader that the candidate is approaching the task with a structured framework rather than reacting to whatever object catches the eye first.

Standard formulations include:

  • "This is a photograph of an office, and several people are working at their desks."
  • "This appears to be a busy street in a downtown commercial area during the afternoon."
  • "The image shows the interior of a restaurant, and a server is taking an order from two customers."
  • "This is a picture of a hotel lobby, and a guest is checking in at the front desk."

The overview should be one sentence — long enough to establish the scene type and the most prominent activity, short enough to leave forty seconds for the rest of the response.

Move 2: Foreground or primary subject

The second move identifies the foreground or the primary subject of the picture and describes its salient features. In a workplace photograph, the foreground is typically a person or a group of people performing an action. In a public-space photograph, the foreground is often a single person, a pair of people in interaction, or a prominent object.

Standard formulations include:

  • "In the foreground, a man in a blue suit is typing on a laptop computer."
  • "The person closest to the camera is a woman wearing a red jacket, and she is holding a clipboard."
  • "Two people are sitting at a table in the front of the frame, and they appear to be having a meeting."

Two anchoring vocabulary devices help. First, color-and-clothing anchors ("a man in a blue suit," "a woman wearing a red jacket") are scoring signals — they demonstrate vocabulary range and let the listener pick out the figure on the visual mental-map. Second, present-progressive verbs ("is typing," "is holding," "is sitting") are the conventional tense for picture description and signal control of the genre.

Move 3: Background or secondary subjects

The third move shifts to the background or the secondary subjects and provides additional context. The shift should be marked with a transition phrase that cues the listener to the change in focus.

Standard formulations for the transition include:

  • "In the background, ..."
  • "Behind them, ..."
  • "Further into the scene, ..."
  • "On the right side of the picture, ..."

Standard formulations for the secondary subjects include:

  • "In the background, other employees are working at computer workstations."
  • "Behind the main figure, several cars are parked along the street."
  • "Further into the scene, a row of tables is occupied by other customers."

The background description should not be exhaustive. A response that attempts to inventory every visible object in the background reads as a list rather than a description and is penalized. Two to three background details, selected for their relevance to the scene, are sufficient.

Move 4: Inference or framing

The fourth move provides an inference about what is happening, what the situation suggests, or what the people might be doing. The move converts the response from pure observation into interpretive description and is the rate-limiter for level-4 scoring.

Standard formulations include:

  • "Overall, the scene suggests a typical workday in a corporate office."
  • "It looks like the meeting has just started, because the participants still have their notebooks closed."
  • "The mood of the picture is relaxed, and the customers appear to be enjoying their conversation."
  • "Based on the equipment, this seems to be a hotel that caters to business travelers."

The inference move can be brief — one sentence is sufficient. The move should be grounded in a visible detail ("because the participants still have their notebooks closed") rather than asserted without evidence ("it must be a Monday morning"). Graders penalize unfounded inferences.

The three detail-selection heuristics

Within the four-move structure, three detail-selection heuristics reliably produce level-3 and level-4 responses. The heuristics differ in how Move 2 (foreground) and Move 3 (background) are populated with details.

Heuristic A — Foreground-to-background

In Heuristic A, the response describes the most prominent foreground element first, then the secondary foreground elements, then the background. The heuristic is the safest default and works for almost any TOEIC Link Part 3 picture.

Example sequence:

Move 1 (overview): This is a photograph of a coffee shop during the morning.

Move 2 (foreground): In the foreground, a barista is preparing an espresso behind the counter. A customer at the counter is holding a paper cup and waiting for the drink.

Move 3 (background): In the background, other customers are sitting at small tables, working on laptops or reading newspapers.

Move 4 (inference): The atmosphere seems busy but calm, suggesting a coffee shop that caters to remote workers and morning commuters.

Heuristic A reads as organized and is the strongest pattern for pictures with a clear foreground-background depth.

Heuristic B — Left-to-right

In Heuristic B, the response sweeps across the picture from left to right, describing each region in turn. The heuristic works well when the picture lacks a clear foreground-background depth — for example, a panoramic landscape, a long row of objects, or a group of people in the same plane.

Example sequence:

Move 1 (overview): This is a photograph of a long counter at an airport check-in area.

Move 2 (left side): On the left side of the picture, a family with several suitcases is approaching the counter.

Move 3 (center and right): In the center, an airline employee is checking documents, and on the right side, two passengers are reviewing their boarding passes.

Move 4 (inference): The scene appears to be a busy morning at the airport, possibly during a peak travel period.

Heuristic B reads as systematic and is the strongest pattern for laterally oriented pictures.

Heuristic C — Person-then-object

In Heuristic C, the response describes the people first (all of them, in priority order), then the objects and the setting. The heuristic works well when the picture is people-centric (a meeting, a conversation, a service interaction) and the objects are secondary.

Example sequence:

Move 1 (overview): This is a picture of a project meeting in a conference room.

Move 2 (people): Four people are seated around a table. The person at the head of the table is presenting, and three others are taking notes and asking questions.

Move 3 (objects and setting): On the table, there are laptops, papers, and coffee cups. A whiteboard at the back of the room shows a diagram with several colored markers.

Move 4 (inference): The meeting looks like a strategy session, and the presenter appears to be leading a planning discussion.

Heuristic C reads as people-centered and is the strongest pattern for meeting, customer-service, and group-activity pictures.

The four observation traps

Even with the four moves correctly executed, four observation traps depress otherwise fluent responses to level 2.

Trap 1: Inventory listing

The response lists every visible object without selection or sequencing. A common variant is the candidate naming twelve to fifteen objects in succession ("there is a desk, and a chair, and a lamp, and a computer, and a keyboard, and a mouse, and a phone, and a notebook...") without distinguishing foreground from background or primary from secondary. Inventory listing signals weaker visual-analytic discipline and is penalized.

The remediation is to select three to five details per move and sequence them by relevance and by Heuristic A, B, or C.

Trap 2: Fabricated specificity

The response invents details that the picture does not show. A common variant is naming brand names ("a Toyota car," "a Sony television") that the picture does not actually display, or inferring identities ("the manager," "the CEO," "the doctor") without supporting evidence. Fabricated specificity signals that the candidate is guessing rather than describing and is penalized.

The remediation is to use generic-but-accurate terms ("a sedan," "a flat-screen television," "a man in a suit," "a person in a white coat") that match the visible evidence without overclaiming.

Trap 3: Tense drift

The response opens in present progressive but drifts into simple present or past tense by the middle of the response, or vice versa. Tense drift signals weaker control of the descriptive genre and is penalized.

The remediation is to stay in present progressive throughout the response for ongoing actions ("is typing," "is walking," "is talking") and in simple present for static states ("there is," "wears," "appears to be"). Past tense is rarely appropriate for picture description and should be avoided unless the picture clearly depicts a completed action.

Trap 4: Time imbalance

The response is too short to demonstrate descriptive range or too long to fit the time budget. A common short-variant is a response that finishes at fifteen seconds with only the overview and one foreground detail. A common long-variant is a response that runs out of time before reaching Move 4 (inference), capping the score at level 3.

The remediation is to plan for the target time budget. A target of approximately ten seconds per move (four moves × ten seconds = forty seconds) produces a response in the middle of the target band and leaves room for adjustment based on picture complexity.

How to practice the structure under time pressure

The four-move structure and the three detail-selection heuristics are easy to memorize but require practice to execute under the thirty-to-forty-five-second time pressure that Speaking Part 3 imposes. A useful practice routine has three phases.

Phase 1 — heuristic-selection drills. Look at a Part 3 picture and select the appropriate heuristic (A, B, or C) within five seconds, without speaking the response. The drill develops the heuristic-selection instinct that frees up speaking time for content generation. Practice twenty to thirty pictures per session.

Phase 2 — timed-move drills. Speak a complete response to a Part 3 picture with explicit timing for each move — five seconds for planning, ten seconds for Move 1, fifteen seconds for Move 2, ten seconds for Move 3, and five to ten seconds for Move 4. The drill develops the time-budgeting instinct that prevents time imbalance.

Phase 3 — full-condition simulation. Speak Part 3 responses under full TOEIC Link timing conditions, with no heuristic-selection time allocation and no per-move timing aids. The drill simulates exam conditions and surfaces the structural mistakes that emerge under pressure.

Recording the Phase 3 responses and reviewing them against a level-3 / level-4 reference is the highest-leverage feedback loop. Candidates who skip the recording-and-review step typically plateau at level 2, because the structural mistakes that lower their score are not audible in the moment.

Common candidate questions

Q: Should I memorize a fixed opening sentence to use for every picture?

A: A semi-fixed opening template ("This is a photograph of [scene type], and [primary action]") is useful for the first second of the response, but the template should be adapted to the specific picture. A response that opens with the same exact sentence regardless of the picture reads as scripted and is penalized for not engaging with the prompt.

Q: What if I cannot tell what the picture is depicting?

A: Use a hedged overview ("This appears to be a workplace setting, possibly an office or a co-working space") and let the rest of the response focus on observable details. Graders do not penalize uncertain overviews if the rest of the response is structurally coherent. They do penalize confident but inaccurate overviews ("This is a hospital" when the picture shows a hotel lobby).

Q: How many people should I describe?

A: Describe the primary figure or pair in detail (clothing, action, position) and reference the secondary figures in aggregate ("other employees," "additional customers"). A response that attempts to describe each of ten people individually runs out of time and reads as an inventory.

Q: Is it okay to use "I think" or "I believe" in the inference move?

A: Yes, but sparingly. "I think the meeting has just started" reads naturally for the inference move. Using "I think" for observable details ("I think there is a chair") reads as unconfident and is penalized. Reserve the hedging language for the inference move.

Cross-references and further reading

For Speaking Parts 4 through 6 (response and opinion tasks), see the speaking and writing tips guide. For the fluency dimensions that interact with descriptive structure under time pressure, see the speaking fluency and hesitation recovery guide. For pronunciation work that supports the picture-description vocabulary range, see the speaking pronunciation self-assessment guide. For the related listening discrimination skills, see the listening detail vs main idea discrimination guide.