本篇文章给大家谈谈david annotation,以及david.b的知识点,希望对各位有所帮助,不要忘了收藏本站喔。
文章详情介绍:
译文分享 _ 转述语料研究
帮公司翻译了一篇论文供他们参考,在此也简单粗暴地黏贴上来分享一下。这是微软公司2005年刊登的一个有关语义对等的研究过程的介绍,有助于语义查重和转述句的判断。英文原文在下方。
本文简述了本团队创造转述语料库的过程和结论。此语料库在未来还会进一步扩充。
早期相关研究报告如下:
1. 根据宽松的语义对等标准进行打分
此资料包含有5801对相互匹配的句子,耗时18个月从网上数千条新闻中收集而成。每一对句子都附上了人工注解,由数人评估它们在语义上是否足够接近。
打分机制:每一对句子都有两个人负责评估,他们就这两句话“是否语义对等”给出二进制的评估结果;若两人在某两句的评估结果上出现分歧,则将该句子交由第三人进行评判。
结果:解决了打分员之间的分歧点后,我们最终得出了3900对被评为“语义对等”的句子,占最初的5801对句子中的67%。
实际上,很多情况下,被评为“语义对等”的句子对或多或少都还是会在语义上有所区别。如果我们设立一个标准,将一对完全可以替换的句子称 为“双向语义对等”,那么我们这个语料库里的句子可以说是“双向语义基本对等”——也就是说,其中一句话涵盖的信息与另一句略有不同,或是涵盖了另一句话所没有的信息。
具体的评分标准详见第三部分。但总体而言,语义对否对等或多或少取决于打分员的直觉判断,其判断标准是:某句话是否改变了太多原句的含义?它们是“在说同一件事”吗?
这项任务制定的标准确实不够清晰,所以当我们看到打分员彼此之间的“一致率”居然高达83%时,我们其实很惊讶。
我们又做了一系列的试验,细化了评估任务的标准,这导致打分结果的“一致率”有所下降。
我们设计了一个打分表,让打分员评判一句话是否包含另一句话的含义,以及判断改变句式是否影响语义对等。这一系列细化的问题虽然造成了打分员的分歧,但这种打分制度在某些情况下很有用,比如在处理“首语重复法”方面,详见第三章。
我们之所以标记出了“语义基本对等”的句子,而非“语义对等”的句子,是出于实际考虑:如果我们只识别语义严格对等的句子,那我们恐怕只能得出最没有价值的作品,诸如“先生”、“女士”这样不容易错的单个词语替换。而我们的最终目的是为了能让机器识别更为复杂的转述关系,这需要我们采用更加宽松的“语义对等”的标准。因此,我们将此语料库限定在≥8个莱温斯坦距离(用于衡量两个字符串之间的相似度)的标准上。
以较宽松的对等标准来看以下句子,你会发现任何两句话都可被视作“转述”,尽管它们的信息内容有明显不同。
· The genome of the fungal pathogen that causes Sudden Oak Death has been sequenced by US scientists
· Researchers announced Thursday they've completed the genetic blueprint of the blight-causing culprit responsible for sudden oak death
· Scientists have figured out the complete genetic code of a virulent pathogen that has killed tens of thousands of California native oaks
· The East Bay-based Joint Genome Institute said Thursday it has unraveled the genetic blueprint for the diseases that cause the sudden death of oak trees
打分员评估的句子中,有几类具有实体的名词被换成了更宽泛的类属名词,如“周二”被换成了“工作日”,“一万美金”被换成了“钱”等。
注意,许多句子组合在信息内容甚至在用词上有甚多交集,但依然被评为“不对等”。我们采用了一系列过滤技术创造了一个初步的语料库,里面涵盖了不少具有转述关系的例子。这些技术的成功意味着打分员看过的约70%的句子组合都语义对等,剩下的30%的句子组合则属于各种各样的关系,包括“完全不相关”、“有交集但不一样”、“接近语义对等但还是不一样”的句子关系等。因此,这30%被标注为“语义不对等”的句子组合不该被用来作为反向训练的材料。
2. 方法论和结论
这些数据包含5801对句子,采用了人工、二进制是非题式的判断方式,标注某两句话是否互为转述句。
2.1. 方法论
打分员甲为5801对句子打分,打分员乙为3533对句子打分,丙打分2268对句子。
对于甲乙给出答案不同的句子,则由丙给出最终裁定;乙负责给甲和丙出现分歧的句子做出最终裁定。
解决了具有分歧的句子后,我们从5801对句子中最终得出了3900对合格的转述句子,合格率67.23%。
2.4. 测试/训练
我们给每一对句子随机编号,随机打乱和分类后,将其中30%的句子设为“训练”数据,另70%设为“测试”数据。最终的测试/训练比例为约29.7%(1725对)比 70.3% (4076)。
3. 具体的标签细则
3.1. “对等” vs. “不对等” 的内容
l 我们的任务是判断两句话是否表达同一个内容。
l 转述通常依靠替换相似的句法结构和词汇等来实现。
l 总的来说,决定两句话是否表达同一个意思的标准应设置得相对较高,意味着具有疑义的案例应被列为“不对等”而非“对等”。
以下是两个通过换词来成功实现语义对等的例子:
The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.
American intelligence leading up to the war on Iraq will be criticised by a powerful US Congressional committee due to report soon, officials said today.
A strong geomagnetic storm was expected to hit Earth today with the potential to affect electrical grids and satellite communications.
A strong geomagnetic storm is expected to hit Earth sometime %%DAY%% and could knock out electrical grids and satellite communications.
虽然换了词,但说的是同样的内容。这类句子组合便可标记为“对等”。
3.2. 略有不同的“对等”句子
在决定两句话是否互为转述时,可以适当放过一些细小的差别。例如:
An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%%.
An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by "strangulation/asphyxiation,” Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.
以下这对句子同样也讲的是同样的内容,可被视为语义对等:
Mr. Concannon had been doused in petrol, set himself alight and jumped onto a bike to leap eight metres onto a mattress below.
A SYDNEY man suffered serious burns after setting himself alight beforeattempting to jump a BMX bike off a toilet block into a pile of mattresses ,police said.
它们虽在内容上大致相同,但二者在附加信息和修饰性内容方面有所不同。但是,只要它们在意思上相近,我们就允许它们之间存在一定的不匹配的细节。
3.3. 首语重复法
有时候,两句话的区别关系到首语重复法。
这类句子可被标记为转述句,尽管它们(有时候)看起来非常不同。例如:
3.3.1. 指代
But Secretary of State Colin Powell brushed off this possibility(某事的可能性) %%day%%.
Secretary of State Colin Powell last week ruled out a non-aggression treaty(不侵犯条约).
3.3.2. 名词短语替换成代词
Meteorologists predicted the storm (风暴)would become a category %%number%% hurricane before landfall.
It (它)was predicted to become a category 1 hurricane overnight.
3.3.3. 专有名词替换为代词
Earlier, he(他) told France Inter-Radio , ''I think we can now qualify what is happening as a genuine epidemic.''
''I think we can now qualify what is happening as a genuine epidemic,''health minister Jean-Francois Mattei (卫生部长弗朗索瓦)said on France Inter Radio.
3.3.4. 职称+专有名词 替换成代词
''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement, '' chief financial officer Jake Brace(首席财政部长杰克卜雷思) said in a statement.
''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement,'' he (他)said.
3.3.5. 专有名词替换成代词
''Spoofing(电子欺诈) is a problem faced by any company with a trusted domain name that uses e-mail to communicate with its customers.
It (它)is a problem for Amazon and others that have a trusted domain name and use e-mail to communicate with customers.
3.4. 本任务的不明确性
以上这个较模糊的标准可以用于评估大多数句子。我们最终得出的转述句子是“语义接近对等”的,虽然它们含有一些不完全对等的地方。问题在于如何判断“相同”和“不同”的界限,且“不同”到什么程度应被视作“语义不对等”呢?这取决于人的判断。
3.5. 内容不同的句子组合
3.5.1. “不对等”的经典案例
相比上面那些例子,以下的句子明显表达的是不同的内容:
Prime Minister Junichiro Koizumi did not have to dissolve parliament until next summer , when elections for the upper house are also due .
Prime Minister Junichiro Koizumi has urged Nakasone to give up his seat in accordance with the new age rule .
虽然这两句话的主语都是小泉纯一郎,但它们的谓语动词(解散和催促)和宾语(国会、康弘)明显不一样。这两句话就被标记为“不对等”,因为它们讲的是不相关的两个事件。
3.5.2. 同一事件、同一内容,但其中一句话含有另一句所不具备的信息(其中一句大于另一句)
Researchers have identified a genetic pilot light for puberty in both mice and humans .
The discovery of a gene that appears to be a key regulator of puberty in humans and mice could lead to new infertility treatments and contraceptives.
这两句话在内容上相近,中心意义也接近,但它们依然“不对等”,因为其中一句话是另一句话的父集,表现在:第二句话涵盖了第一句话的所有内容,但反过来却不是这样;第二句话还包含了第一句话所不具备的重要的信息(加粗部分)。
这种情况辨认起来有难度,因为父集和子集在含义上确实很一致,几近转述句。但唯一的问题在于,父集拥有子集没有的关键信息,这个重要的附加信息是它们不对等的证明。
在判断的时候,有些较小的差别可被忽视,比如略有不同的状语和人名形态,如:
An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%% .
An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by " strangulation/asphyxiation, " Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.
此外,两句话内容的不对等性还取决于句子的长短。在两句各有20个词的句子里,少或多了一个两个词可能不会有太大差别,但在两句只有5个词的句子中,少了一个单词可能会有很大的影响。
3.5.3. 无法判断两句话是否在讲同一个事件
More than %%NUMBER%% acres burned and more than %%NUMBER%% homes were destroyed in the massive Cedar Fire .
Major fires had burned %%NUMBER%% acres by early last night.
在以上的例子中,这两句话可能都是在讲火灾事件,但它们也有可能是在讲两个不同的事情:其中一个将某一簇火,另一句则在讲同一晚上的好几簇火。碰到这种情况应将其备注“不对等”。
3.5.4. 同样的内容,不同的修辞
The search feature works with around %%NUMBER%% titles from%%NUMBER%% publishers, which translates into some %%NUMBER%% million pages of searchable text .
This innovative search feature lets Amazon customers search the full text of a title to find a book , supplementing the existing search by author or title .
这两句话都讲到了一个新搜索功能的特点,但第一句话重点强调了数据。虽然两句话是同一论点,但它们明显不同,因为前者相当于后者的详述版本。所以这两句话也是不对等的。
3.5.5. 同样的事件,不同的细节
A Hunter Valley woman sentenced to %%NUMBER%% years jail for killing her four babies was only a danger to children in her care, a court was told.
As she stood up yesterday to receive a sentence of %%NUMBER%% years for killing her four babies, Kathleen Folbigg showed no emotion.
这两句话报道的是同样的事件,但第一句话强调了某一特定的法律说法,即女嫌犯的律师说她只会伤害她自己的孩子,而第二句话讲的是她在法庭上面无表情。两句话在重要性和细节上都不一样,所以标记为“不对等”。
原文:
Microsoft Research Paraphrase Corpus
Bill Dolan, Chris Brockett, and Chris Quirk
Microsoft Research
March 2, 2005
This document provides some information about the creation of the corpus, along with results of the annotation effort. If you use the corpus in your research, we would appreciate your citing one or both of the following papers, which give some details of our work on paraphrase and our data annotation efforts. (A paper describing in detail how this corpus was created is currently in progress.) We are continuing to tag data, and hope to release a larger version of this corpus to the research community in the future.
Quirk, C., C. Brockett, and W. B. Dolan. 2004. Monolingual Machine Translation for Paraphrase Generation, In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona Spain.
Dolan W. B., C. Quirk, and C. Brockett. 2004. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. COLING 2004, Geneva, Switzerland.
1. Introduction to the paraphrase tagging task
This dataset consists of 5801 pairs of sentences gleaned over a period of 18 months from thousands of news sources on the web. Accompanying each pair is judgment reflecting whether multiple human annotators considered the two sentences to be close enough in meaning to be considered close paraphrases.
Each pair of sentences has been examined by 2 human judges who were asked to give a binary judgment as to whether the two sentences could be considered “semantically equivalent”. Disagreements were resolved by a 3rd judge. This annotation task was carried out by an independent company, the Butler Hill Group, LLC. Mo Corston-Oliver directed the effort, with Jeff Stevenson, Amy Muia, and David Rojas acting as raters. Mo Corston-Oliver and Jeff Stevenson also helped with the preparation of this document.
After resolving differences between raters, 3900 (67%) of the original 5801 pairs were judged “semantically equivalent”.
In many instances, the pair of sentences rated by 2 judges as “semantically equivalent” will in fact diverge semantically to at least some degree. If a full paraphrase relationship can be described as “bidirectional entailment”, then the majority of the “equivalent” pairs in this dataset exhibit “mostly bidirectional entailments”, with one sentence containing information that differs from or is not contained in the other. Some specific rating criteria are included in a tagging specification (Section 3), but by and large the degree of mismatch allowed before the pair was judged “non-equivalent” was left to the discretion of the individual rater: did a particular set of asymmetries alter the meanings of the sentences enough that they couldn’t be considered “the same” in meaning? This task was ill-defined enough that we were surprised at how high interrater agreement was (averaging 83%).
A series of experiments aimed at making the judging task more concrete resulted in uniformly degraded interrater agreement. Providing a checkbox to allow judges to specify that one sentence entailed another, for instance, left the raters frustrated and had a negative impact on agreement. Similarly, efforts to identify classes of syntactic alternations that would not count against an “equivalent” judgment resulted, in most cases, in a collapse in interrater agreement. The relatively few situations where we found firm guidelines of this type to be helpful (e.g. in dealing with anaphora) are included in Section 3.
The decision to tag sentences as being “more or less semantically equivalent”, rather than “semantically equivalent” was ultimately a practical one: insisting on complete sets of bidirectional entailments would have ruled out all but the most trivial sorts of paraphrase relationships, such as sentence pairs differing only a single word or in the presence of titles like “Mr.” and “Ms.”. Our interest was in identifying more complex paraphrase relationships, which required a somewhat looser definition of what “semantic equivalence” means. In an effort to focus on these more interesting pairs, the dataset was restricted to pairs with a minimum word-based Levenshtein distance of ≥ 8.
Given our relatively loose definition of equivalence, any 2 of the following sentences would probably have been considered “paraphrases”, despite obvious differences in information content:
· The genome of the fungal pathogen that causes Sudden Oak Death has been sequenced by US scientists
· Researchers announced Thursday they've completed the genetic blueprint of the blight-causing culprit responsible for sudden oak death
· Scientists have figured out the complete genetic code of a virulent pathogen that has killed tens of thousands of California native oaks
· The East Bay-based Joint Genome Institute said Thursday it has unraveled the genetic blueprint for the diseases that cause the sudden death of oak trees
Raters were presented with sentences in which several classes of named entities were replaced by generic tags, so that “Tuesday” became %%DAY%%, “$10,000” became “%%MONEY%%, and so on. The release versions, however, preserve the original strings.
Note that many of the sentence pairs judged to be “not equivalent” will still overlap significantly in information content and even wording. A variety of automatic filtering techniques were used to create an initial dataset that was rich in paraphrase relationships, and the success of these techniques meant that approximately 70% of the pairs examined by raters were, by our criteria, semantically equivalent. The remaining 30% represent a range of relationships, from pairs that are completely unrelated semantically, to those that are partially overlapping, to those that are almost-but-not-quite semantically equivalent. For this reason, this “not equivalent” set should not be used as negative training data.
We have made every effort to ensure that each sentence in this dataset has been given proper attribution. If you encounter any errors/omissions, please contact Bill Dolan (), and we will promptly modify the data to reflect the correct information.
2. Methodology and Results
This data set consists of 5801 sentence pairs, with a binary human judgment of whether or not the pairing constitutes a paraphrase.
2.1. Methodology
To generate the judgments, we used 3 raters to score the sentence pairs according to a given specification. Rater 1 scored all 5801 sentences. Rater 2 scored 3533 sentences, and Rater 3 scored 2268 sentences. For the sentences where Rater 1 and 2 did not agree on the judgment, Rater 3 gave a final judgment, while Rater 2 gave the final judgment on sentences where Rater 1 and Rater 3 did not agree.
2.2. Interrater Agreement
To test interrater agreement, we took a simple percentage:
Total scored
Total agreements
Percentage agreement
Raters 1 & 2
3533
2904
82.20
Raters 1 & 3
2268
1921
84.70
2.3. Overall scoring results
We computed scoring results for each individual (raw scores, before resolvingdifferences):
Total scored
Number “yes”
Percentage “yes”
Rater 1
5801
3601
62.08
Rater 2
3533
2589
73.28
Rater 3
2268
1612
71.08
After resolving differences, we judged 3900 out of 5801 sentence pairs to be valid paraphrases, for a final percentage of 67.23%
2.4. Test/training
We assigned a random sequence ID to each sentence pair, sorted them, and assigned the first 30% of the data to be “training” and the last 70% to be “test” data. For obscure technical reasons, the final test/train percentage is inexact (29.7% (1725 sentence pairs) vs. 70.3% (4076 sentence pairs))
3. Detailed Tagging Guidelines
3.1. “Equivalent” vs. “not equivalent” content
· In this task, we are trying to determine if two sentences express the same content.
· As is true for paraphrase in general, this may be realized by means of alternative but similar syntactic constructions and lexical items, etc.
· In general, the standard as to whether two sentences express the same content should be relatively high, meaning that many of ambiguous cases should be marked "not equivalent" rather than "equivalent".
Examples of sentences with “equivalent” content expressed via alternative lexical items:
The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.
American intelligence leading up to the war on Iraq will be criticised by a powerful US Congressional committee due to report soon, officials said today.
A strong geomagnetic storm was expected to hit Earth today with the potential to affect electrical grids and satellite communications.
A strong geomagnetic storm is expected to hit Earth sometime %%DAY%% and could knock out electrical grids and satellite communications.
These sentences are clearly paraphrases. The different lexical items are still expressing the same content. This type of sentence pair should be tagged as “equivalent”.
3.2. “Equivalent” sentence pairs with minor differences in content
Minor differences between sentences can be overlooked when determining if two sentences are paraphrases. For example:
An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%%.
An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by "strangulation/asphyxiation,” Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.
The following sentences also express “equivalent” content:
Mr. Concannon had been doused in petrol, set himself alight andjumped onto a bike to leap eight metres onto a mattress below.
A SYDNEY man suffered serious burns after setting himself alightbefore attempting to jump a BMX bike off a toilet block into a pile of mattresses , police said.
The agent (Mr. Concanon), the predicated actions (set himself alight, jumped a bike), and important details (onto a pile of mattresses) are present in both sentences. Additional lexical material in either sentence mainly serves to embellish the main propositions (for example, “. . .suffered serious burns” which is logically entailed by “set himself alight”). Also notice that the details of a given proposition need not be exact: a mattress (sing.) vs. a pile of mattresses (plur.). Finally, notice that the second of the sentence pairs in the previous example is “attributed” to the police where the first is not. This difference between sentences is also acceptable for purposes of tagging them as paraphrases.
For this type of sentence pair, we want to mark them as equivalent (paraphrases). Notice that the sentence pairs, while clearly similar overall in content, both differ in additional, modifying content.. As the main content of the sentences similar in meaning, we “allow” some minor content mismatch.
3.3. Anaphora
Sometimes the difference between two sentences involves anaphora (NPs and pronominal). These sentences can be tagged as paraphrases despite the (sometimes) fairly large gap between them in terms of their corresponding full-form NPs. Examples follow.
3.3.1. Demonstratives
But Secretary of State Colin Powell brushed off this possibility %%day%%.
Secretary of State Colin Powell last week ruled out a non-aggression treaty.
3.3.2. NP ->pro
Meteorologists predicted the storm would become a category %%number%% hurricane before landfall.
It was predicted to become a category 1 hurricane overnight.
3.3.3. Proper NP (+animate) -> pro
Earlier, he told France Inter-Radio , ''I think we can now qualify what is happening as a genuine epidemic.''
''I think we can now qualify what is happening as a genuine epidemic,'' health minister Jean-Francois Mattei said on France Inter Radio.
3.3.4. Title + proper NP (+animate) -> pro
''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement, '' chief financial officer Jake Brace said in a statement.
''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement,'' he said.
3.3.5. NP (-animate) -> pro
''Spoofing is a problem faced by any company with a trusted domain name that uses e-mail to communicate with its customers.
It is a problem for Amazon and others that have a trusted domain name and use e-mail to communicate with customers.
3.4. Inherent ambiguity of the task
The relatively holistic/vague criteria established above should work well for most sentence pairs. In the end, we’re tagging something that’s not quite paraphrase, but something like “semantic near-equivalence” – sentences pairs that ideally involve complete sets of bidirectional entailments, but which in fact often have some entailment asymmetries or other mismatches. The issue here is when those asymmetries/differences become significant enough to make the pair different enough that you don’t think they mean more or less the same thing anymore, where “more or less” becomes a personal judgment call.
3.5. Sentence pairs with “different” content
3.5.1. “Different” content: prototypical example
In contrast to the examples above, the following sentences clearly express “different” content:
Prime Minister Junichiro Koizumi did not have to dissolveparliament until next summer , when elections for the upper house are also due .
Prime Minister Junichiro Koizumi has urged Nakasone to give up his seat in accordance with the new age rule .
While the principal agent (Koizumi) is the same, predicated actions, i.e. verbs (dissolve / urge) and other arguments (parliament / Nakasone) are clearly different. The additional material found in either sentence does not embellish the main proposition but instead contains important content itself. These two sentence pairs should be marked as “not equivalent” in that while they share an agent “Koizumi,” they are about unrelated events. Again, ambiguous cases should be marked "not equivalent" rather than "equivalent”.
3.5.2. Shared content of the same event, etc. but lacking details (one sentence is a superset of the other)
Researchers have identified a genetic pilot light for puberty in both mice and humans .
The discovery of a gene that appears to be a key regulator of puberty in humans and mice could lead to new infertility treatments and contraceptives.
These sentences are similar in content, refer to a similar key piece of information, but cannot be marked as “equivalent”. The sentences should be tagged as “not equivalent” because even though the content of the sentences is similar, one sentence is a significantly larger superset of the other: all the content of the first sentence is in the second, but not vice-versa. The superset sentence contains important content information (above, in bold) not present in the second sentence.
Some similar sentence pairs follow (missing content in superset sentence is in bold):
SOME %%NUMBER%% jobs are set to go at Cadbury Schweppes , the confectionery and drinks giant , as part of a sweeping cost reduction programme announced today .
Confectionery group Cadbury Schweppes has warned of further cuts to its %%NUMBER%% -strong UK workforce .
This sentence is difficult in that, while one sentence is a superset of the other, it is alsoarguably the case that the sentences are “almost” paraphrases except when we see that the content of the underlined portions in the two sentences above is exclusive to one sentence. In the end, however, the material in bold is an important difference in content between the sentences, and adds important additional content, leading us to prefer tag them as “not equivalent”.
Please use your best judgment in choosing to tag sentences as “equivalent” or “not equivalent”. Many of the sentence pairs you see differ due to the way editors eliminate language/content they deem unnecessary. Sometimes the two sentences will differ in information that conveys important additional information. Sentences like these should be tagged as “not equivalent”:
The former wife of rapper Eminem has been electronically tagged after missing two court appearances .
After missing two court appearances in a cocaine possession case, Eminem's ex-wife has been placed under electronic house arrest .
The issue of whether or not the extra/missing information is significant enough to warrant treating the sentences as “not equivalent” amounts to a judgment call. Minor differences between sentences can be overlooked when determining if two sentences are paraphrases. As seen in a previous example sentence pair, the only differences in content between the following sentences are the reduced forms of names and adverbial modifiers (dates). There are no major differences in content between these sentences. They can be marked as “equivalent”.
An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%% .
An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by " strangulation/asphyxiation , " Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.
The role of content asymmetries in determining whether sentences should be marked as equivalent/not equivalent is also linked to sentence length. In a pair of 20-word sentences, the presence/absence of a single modifier might be lost in the noise, while in a pair of 5 word sentences it might take on much greater significance. There is no good way to normalize for length in such cases, so again, just depend on your own judgment.
3.5.3. Cannot determine if sentences refer to the same event
More than %%NUMBER%% acres burned and more than %%NUMBER%% homes were destroyed in the massive Cedar Fire .
Major fires had burned %%NUMBER%% acres by early last night.
In this example, both sentences could be about the same series of events (fires). However, these are possibly about two events: one is about a specific fire, the other about a cluster of fires. This should lead us to annotate these sentences as expressing “not equivalent” content. Another such example follows:
The spokeswoman said four soldiers were wounded in the attack, which took place just before noon around %%NUMBER%% km ( %%NUMBER%% miles ) north of the capital Baghdad.
Two US soldiers were killed in a mortar attack near the Iraqi town of Samarra yesterday , a US military spokeswoman said.
Notice that both sentences report the deaths of soldiers in an attack in some Iraqi town. However, it is clear that the two sentences could be describing two isolated events. The fact that there is a discrepancy in the number of reported deaths should add to one’s suspicions that this might be the case. Since the sentences share some content, but we cannot be sure they refer to the same event, we should seek to err on the side of caution and mark them as “not equivalent”.
3.5.4. Shared content but different rhetorical structure
The search feature works with around %%NUMBER%% titles from%%NUMBER%% publishers, which translates into some %%NUMBER%% million pages of searchable text .
This innovative search feature lets Amazon customers search the full text of a title to find a book , supplementing the existing search by author or title .
In this sentence pair, both sentences clearly make statements about a new search feature. However, notice the emphasis placed on the amount of data in the first sentence via the rhetorical device of reiterated citation of numbers. The two sentences are about the same subject matter, but they are significantly different in that the first might occur as a detailed exploration of the second. Therefore, this leads us to mark the sentences as “not equivalent”.
3.5.5. Same event but details different emphasis
A Hunter Valley woman sentenced to %%NUMBER%% years jail for killing her four babies was only a danger to children in her care, a court was told.
As she stood up yesterday to receive a sentence of %%NUMBER%% years for killing her four babies, Kathleen Folbigg showed no emotion.
These sentences clearly report information related to the same event, but the first sentence emphasizes a particular legal argument presented by the convicted woman’s lawyer, while the second focuses on her apparent mental state at the trial. This type of sentence pair should be tagged as “not equivalent”. Given the magnitude of the semantic divergence between these two sentences – both in terms of content and emphasis – they should be treated as “not equivalent”.
More example sentence pairs which, while clearly significantly overlapping in content, should be tagged as “not equivalent”:
Authorities dubbed the investigation Operation Rollback , a reference to Wal-Mart's name for price reductions .
The ICE's investigation , known as " Operation Rollback " , targeted workers at %%NUMBER%% Wal-Mart stores in %%NUMBER%% states .
Researchers also found that women with mutations in the BRCA1 or BRCA2 gene have a %%NUMBER%% % to %%NUMBER%% % risk of ovarian cancer , depending on which gene is affected .
Earlier studies had suggested that the breast cancer risk from the gene mutations ranged from %%NUMBER%% % to %%NUMBER%% % .
Note that while the sentences may refer to the same piece of information, the inclusion of “earlier studies….” suggests this may not be the case. Therefore, they should be tagged as “not equivalent”.