How ATS parsers actually work (from someone who reversed one)

Most articles about ATS systems tell you what to do — use standard headers, avoid text boxes, spell out acronyms. This one explains why those rules exist, by going inside the machine.

What follows is a reconstruction of the ATS parsing pipeline based on published research, ATS vendor documentation, academic NLP literature, and the observable behaviour of real systems — Workday, Greenhouse, Lever, and Taleo. None of the system-specific claims are proprietary or invented. Where vendors have not published details, the behaviour described is the consensus from the academic parsing literature and from empirical testing with known inputs and observable outputs.

Parsing pipeline overview

An ATS does not read your CV. It runs your document through a multi-stage data extraction pipeline and then operates on the structured output. The pipeline has six stages: ingestion, normalisation, section detection, entity extraction, matching, and scoring.

Understanding each stage tells you exactly where your CV can fail — and why.

Stage 1 — Ingestion. The document file is received and converted to a canonical internal format, typically a Unicode string representation of the text content. For .docx files, this means parsing the Open XML structure and concatenating text runs from the document body, discarding content in headers, footers, text boxes, shapes, and drawing canvases. For PDFs, a text extraction library (commonly pdfminer or a commercial equivalent) attempts to reconstruct reading order from the glyph position data in the PDF's content stream. PDFs generated from design tools often fail here because they embed glyphs without reading-order metadata.

Stage 2 — Normalisation. The raw text string is normalised: Unicode is normalised to NFC form, soft hyphens are removed, ligatures are expanded (the "fi" ligature in some fonts becomes two characters: f and i), and whitespace is collapsed. Special characters that appear as garbled output from encoding errors are stripped or replaced. This stage is where non-standard bullet characters — arrows, chevrons, and custom Unicode symbols — sometimes disappear or become question marks.

Stage 3 — Section detection. The normalised text is segmented into labelled sections: contact, summary, experience, education, skills, and miscellaneous. This is the most complex stage and the most common source of classification errors.

Stage 4 — Entity extraction. Within each classified section, the parser extracts structured entities: name and contact details from the contact section, company/title/date triples from experience, degree/institution/date triples from education, and skill terms from the skills section.

Stage 5 — Matching. The extracted entities are compared against the job requirements, which have themselves been parsed from the job description using a similar pipeline.

Stage 6 — Scoring. A match score is assembled from the entity comparisons and returned as a numeric value that determines where the application appears in the recruiter's queue.

Each stage depends on the previous one. A section detection failure in stage 3 means the entity extractor in stage 4 works on misclassified text, producing garbled output that generates a low match score in stage 6 — even if the candidate is a perfect fit.

Section boundary detection

Section detection is the stage most directly affected by how you format your CV. The parser needs to identify where one section ends and another begins, and classify what each section contains.

The dominant approach across all major ATS platforms is a combination of three signals.

Heading pattern matching. The parser maintains a vocabulary of known section headings: "Work Experience," "Employment History," "Professional Experience," "Career History" for the experience section; "Education," "Academic Background" for education; "Skills," "Core Competencies," "Technical Skills" for the skills section. Each heading in this vocabulary is associated with one or more section labels. When the parser encounters a line that matches a known heading (typically case-insensitive, with some fuzzy tolerance), it starts a new section with the matched label.

The fuzzy tolerance is usually edit-distance based — a heading three characters away from "Work Experience" may still be classified correctly. But "Where I've been" has an edit distance of roughly 20 from any known heading and will not match. The parser will either classify the following text as continuation of the previous section or as "miscellaneous," which contributes nothing to the entity extraction for experience.

Whitespace and layout heuristics. After heading matching, the parser applies layout heuristics: a line that is significantly shorter than surrounding lines, followed by a blank line, and followed by longer lines is a candidate heading even if it does not match the vocabulary. This is why a blank line above and below your section headers matters — it makes the layout signal unambiguous. A continuous block of text with no whitespace separations makes it harder for the parser to identify section transitions.

Font and formatting signals (in .docx). When parsing Word documents, the parser can access character formatting metadata from the Open XML structure. Text formatted as Heading 1 or Heading 2 in Word's paragraph styles is a strong heading signal that does not depend on vocabulary matching at all. This is an underused advantage of .docx format: apply Word's built-in heading styles to your section headers, and the parser receives an unambiguous classification signal regardless of what the heading text says.

The practical consequence of how section detection works is that misclassification is the primary failure mode for unusual CV formats. A candidate who puts their skills section before their work experience, with non-standard heading text and no whitespace separation, risks having both sections classified as miscellaneous. Neither will contribute to the match score.

Entity extraction

Once sections are classified, the parser extracts structured entities from the text within each section. Each entity type has its own extraction logic, and each has characteristic failure modes.

Name extraction. The contact section is typically the first block of text in the document. The parser expects a name to appear in the first few lines, formatted as one or two capitalised words (or more for longer names). Names are identified using a combination of positional heuristics and a name gazetteer — a list of known first names and surnames. This works well for common Western names but can misclassify unusual names or names from cultures with different ordering conventions. Workaround: ensure your name is on the very first line, isolated from other content.

Date extraction. Date parsing is where entity extraction most frequently fails, and it is the most consequential failure because dates affect tenure calculation, which affects seniority scoring.

The parser applies a set of date patterns — regular expressions for formats like "January 2022," "Jan 2022," "01/2022," "2022–01" — and attempts to match them within the context of experience entries. The failure modes are:

Year-only dates. "2020 – 2022" is ambiguous: does the candidate mean Jan 2020 – Dec 2022 (24 months) or Dec 2020 – Jan 2022 (13 months)? Different parsers resolve this differently. Most add 12 months to the uncertainty range, which can cause under-counting or over-counting of tenure.
Overlapping dates from promotions. If you stayed at one company through a promotion and list both roles under the same employer, the parser may have difficulty attributing the correct date range to each title. Separate promoted roles into distinct entries with their own date ranges.
"Ongoing" and "current" synonyms. "Present," "Now," "Current," "Ongoing," and "—" (dash with no end date) are all used to indicate current employment. Most parsers handle "Present" and "Current" reliably. "Now," "Ongoing," and bare dashes are less consistently supported.
Non-standard date separators. Some parsers expect a hyphen ("–"), others an em-dash ("—"), and some will not parse "to" or "/" as a date range separator. The safest separator across all systems is " – " (space, hyphen, space) or the word "to".

Title and company extraction. Job titles and company names are extracted as a pair within each experience entry. The parser expects to find them together, typically on the first line of the entry, before the date and the bullet points. Separation is inferred by position: the title is usually first, the company second, or vice versa depending on the CV style. Some parsers use a company gazetteer (a list of known company names) to improve accuracy; others rely purely on position.

The failure mode here is mixing title, company, and date onto a single line with an unusual separator. "Senior Engineer @ ACME Corp / Jan 2022 – Mar 2024" uses two different separators and may confuse the parser about which token is the title and which is the company. The safest format is one element per logical unit: title on one line, company on the same line separated by a pipe or comma, dates on the same line or the next.

Skill extraction. Skills are extracted from the skills section using a combination of a skill term gazetteer (a vocabulary of known technologies, tools, and competencies) and keyword frequency analysis. The gazetteer approach means that skills not in the vocabulary may not be recognised. This is one reason why unusual acronyms and proprietary tool names sometimes do not appear in ATS keyword reports even when clearly listed in your CV.

Keyword matching — exact, stemmed, and semantic

Once entities are extracted, the matching stage compares them against the job requirements. There are three tiers of matching, and different ATS platforms use different combinations.

Exact matching. The simplest approach: a keyword from the job description appears verbatim in the CV. "Python" in the JD matches "Python" in the CV. Case-insensitive. This is the baseline that all ATS systems implement.

Stemmed matching. The parser applies a stemming algorithm — most commonly a variant of the Porter or Snowball stemmer — before matching. Stemming reduces words to their root form: "managed" and "managing" both stem to "manag," "optimised" and "optimising" both stem to "optim." This means writing "manage" when the JD says "managing" will still match under stemming. Most modern ATS platforms implement stemmed matching, which is why synonym-level variation in word form does not usually matter.

Stemming does not handle synonyms. "Led" and "managed" have different stems and will not match each other under a stemmer alone. If the JD says "led cross-functional teams" and your CV says "managed cross-functional teams," that specific phrase will not match on the stemmed pass.

Semantic matching. Some ATS platforms — particularly Greenhouse and Workday in their more recent versions — supplement exact and stemmed matching with a semantic similarity layer. This is typically implemented using pre-trained word embeddings or a fine-tuned sentence transformer model. The semantic layer can match "managed" to "led," "Python" to "Python 3," and "machine learning" to "ML" with a cosine similarity score above a threshold.

The threshold matters. At a high similarity threshold (above 0.85 cosine similarity), only near-synonyms match. At a lower threshold (0.70–0.75), more distant synonyms match but false positives increase. Most platforms tune to the higher end for precision. The practical implication: do not rely on semantic matching to bridge large conceptual gaps. "Led initiatives" will not match "increased revenue by 40%" semantically; they have different embedding representations.

What this means in practice. The safest approach is to write your CV in the language of the job description. Where the JD says "product roadmap," use "product roadmap" — not "feature planning" or "development roadmap." Where the JD says "SQL," use "SQL" — not just "databases." The keyword matching layer is not reading for meaning; it is looking for surface-level tokens.

The table below summarises when each matching type applies:

Matching type	When it fires	Example
Exact	Always	JD: "Python" → CV: "Python"
Stemmed	Always (most systems)	JD: "managing" → CV: "managed"
Semantic	Some platforms, varies by version	JD: "led" → CV: "managed"
No match	Always	JD: "roadmap planning" → CV: "feature planning"

Score composition

The match score is not a simple keyword count. It is a weighted sum of multiple signals, and the weighting model differs across platforms.

Required vs. preferred keywords. Job descriptions in most ATS platforms are parsed into two categories of requirements: required (or "must-have") and preferred. Required keywords receive significantly higher weight in the scoring formula. A candidate who matches all preferred keywords but none of the required keywords will score lower than a candidate who matches all required keywords and none of the preferred ones.

You cannot always tell from a public job posting which keywords are marked required internally. The safest assumption is that skills listed under "Requirements" or "Must have" in the JD are required, and skills under "Nice to have" or "Preferred" are weighted lower.

Keyword frequency signals. Some ATS platforms weight keywords by their frequency in the JD. A skill mentioned three times (in the job title, the requirements list, and the responsibilities section) is treated as more important than a skill mentioned once. Writing that skill into your CV once may not be sufficient. A natural, non-stuffed mention in your summary, once in your skills list, and once in a bullet point is the target.

Seniority and tenure. Separate from keyword matching, most ATS systems score candidates on seniority signals: total years of experience, years in relevant roles, and the career progression suggested by title changes. These scores combine with keyword match scores in the final ranking formula. A strong keyword match with very low tenure for a senior role will score below a moderate keyword match with the expected tenure.

Title match. Some platforms give specific weight to candidates whose previous job titles closely match either the target role title or titles listed in the requirements. A candidate whose most recent title is "Senior Software Engineer" applying to a "Senior Software Engineer" role gets a title match bonus. This is distinct from keyword matching — it operates on the extracted title entity, not the full text.

The final score. The numeric score returned by the ATS is proprietary to each vendor. Workday and Taleo do not publish their scoring formulas. What is publicly known is that the score is a function of keyword match, seniority signals, and title match, combined via a weighted formula that recruiter administrators can configure per-role. This configurability means that the same CV can score differently for the same job at two companies using the same ATS vendor, if their administrators have set different keyword weights.

What this means for how you write your CV

The parsing pipeline above has direct, actionable implications for how you write each section of your CV.

Use the job description's vocabulary. The matching layer operates on tokens, not meaning. Mirror the exact language of the JD for your key skills and competencies. If the JD says "cross-functional collaboration," use that phrase — not "working across teams" or "multi-team coordination."

Structure your experience entries consistently. The entity extractor expects consistent structure: title, company, date range, then bullets. Any deviation from this pattern risks misattributing the data. Use the same format for every role.

Make section headers unambiguous. Use headings from the known vocabulary: "Work Experience," "Education," "Skills," "Professional Summary." Apply Word's heading styles if submitting a .docx. Add whitespace above and below headers.

Spell out acronyms on first use. The gazetteer matching for skills will typically handle well-known acronyms like "SQL," "AWS," and "ML." Less common acronyms may not be in the vocabulary. "CI/CD (continuous integration and continuous deployment)" covers both the acronym form that experts use in JDs and the expanded form that some parsers match.

Use month-and-year dates throughout. Year-only dates introduce uncertainty in tenure calculation. Be consistent: if you use "Jan 2022," use that format for every date in the document.

Do not rely on formatting to communicate structure. Bold text, font size, and colour do not create reliable section boundaries for a parser. Whitespace and heading vocabulary do. Your CV should make structural sense in a plain-text version with all formatting stripped.

The ATS CV checker surfaces most of the extraction and matching failures described here — keyword gaps, missing required terms, and structural issues — before you submit. For the baseline formatting rules, the ATS resume format guide covers the ten rules that affect parse quality in practical terms. And for the JD keyword extraction step specifically, how to find ATS keywords in any job description is the companion piece to this one.

For the full picture of how ATS systems fit into the recruiter workflow and what to do before and after submission, the complete ATS resume guide covers the end-to-end process.

Frequently asked questions

Do all ATS platforms use the same parsing approach?

No. The core pipeline — ingestion, normalisation, section detection, entity extraction, matching, scoring — is common across major platforms, but the implementation details differ significantly. Workday uses a proprietary parsing engine built in-house; Greenhouse and Lever both use third-party parsing services with their own matching layers on top. Taleo (Oracle) has legacy architecture that is notably less tolerant of unconventional formatting than the newer platforms. The rules in this post represent the common denominator across all of them.

Does semantic matching mean I can use synonyms freely?

Only partially. Semantic matching is present in some ATS platforms but not all, and when it is present the similarity threshold is typically set high enough that only close synonyms match. 'Led' and 'managed' will usually match semantically. 'Drove revenue growth' and 'increased sales' may or may not match, depending on the platform and configuration. The safest approach remains mirroring the JD's vocabulary for key terms, while using natural variation elsewhere to avoid keyword stuffing.

Can a recruiter override the ATS score?

Yes. The ATS score determines the order in which applications appear in the recruiter's queue, but a recruiter can always manually review any application regardless of score. The score is a filter for efficiency, not an absolute gate. However, applications with very low scores are often never seen — there is simply not enough time for a recruiter handling hundreds of applications to manually review every low-scoring submission. The practical effect is that a low ATS score significantly reduces your probability of human review, even if it does not make it zero.

Does the file format really matter that much?

Yes, but the risk is asymmetric. A well-structured .docx file from Word almost never causes ingestion or normalisation failures. A well-structured PDF from Word is nearly as reliable. A PDF from a design tool like Canva or a visually complex template from a CV builder often fails ingestion entirely — the text layer is either absent or has corrupted reading order. The safe default is .docx for ATS submissions and PDF for direct recruiter contact where a human will read it.