■语料类别1: 中英文语料前后相连,中间无统一分隔符,如下表所示:
Skin card 贴体彩卡Blister card 吸塑彩卡Color paper card 彩色纸卡Heavy blister card 加厚吸塑插卡Double blister card 双层吸塑插卡 Display box 彩色展示盒burrerfly hole/euro hole 飞机孔sticker 不干胶pantone No. 潘东号inner case 内盒 front mark 正唛three layers of plastic packing 三层塑料打包
请点击此处:添词入口一
■语料类别2: 中英文语料前后相连,中间无统一分隔符,但有序分段:
Skin card 贴体彩卡 Blister card 吸塑彩卡 Color paper card 彩色纸卡 Heavy blister card 加厚吸塑插卡 Double blister card 双层吸塑插卡 Display box 彩色展示盒 Display box&Blister card 彩卡+展示盒 Neutral inner box 单瓦楞夹档白盒 Painted inner box 天地盖单色中盒 burrerfly hole/euro hole 飞机孔 sticker 不干胶 pantone No. 潘东号 inner case 内盒 front mark 正唛 side mark 侧唛 shipping mark 唛头 wooden case packing 木箱包装 seaworthy packing 海运包装 three layers of plastic packing 三层塑料打包
请点击此处:添词入口二
■语料类别3: 中英文语料间有统一分隔符,且有序分段,如下表所示:
Skin card ||贴体彩卡 Blister card ||吸塑彩卡 Color paper card ||彩色纸卡 Heavy blister card ||加厚吸塑插卡 Double blister card ||双层吸塑插卡 Display box ||彩色展示盒 burrerfly hole/euro hole ||飞机孔 sticker ||不干胶 pantone No. ||潘东号 inner case ||内盒 front mark ||正唛 three layers of plastic packing ||三层塑料打包
请点击此处:添词入口三
■语料类别4: 中英文语料间有统一分隔符,但前后相连,无有序分段,如下表所示:
Skin card ||贴体彩卡||Blister card ||吸塑彩卡||Color paper card ||彩色纸卡||Heavy blister card ||加厚吸塑插卡||Double blister card ||双层吸塑插卡|| Display box ||彩色展示盒||burrerfly hole/euro hole ||飞机孔||sticker ||不干胶||pantone No. ||潘东号||inner case ||内盒||front mark ||正唛||three layers of plastic packing ||三层塑料打包
请点击此处:添词入口四
■语料类别5: 中英文语料各自成一段,间隔有序排列,如下表所示:
Skin card 贴体彩卡 Blister card 吸塑彩卡 Color paper card 彩色纸卡 Heavy blister card 加厚吸塑插卡 Double blister card 双层吸塑插卡 front mark 正唛 three layers of plastic packing 三层塑料打包
请点击此处:添词入口五
■语料类别6: 摘录自网络的纯中文语料,其后有不正确的段落符或换行符,正确段落前首行缩进,如下表所示:
S "这孩子到底怎么啦,我真搞不懂?你这个汤姆!" 还是没有人答应。 这老太太拉低眼镜从镜片上方朝房间看了看,然 后她又抬高眼镜从镜片下面看。她很少或者干脆说她 从来没戴正眼镜来找像一个小男孩这样小的东西。这 副眼镜是很考究的,也是她的骄傲,她配这副眼镜不 是为了实用,而是为了"装饰",为了"漂亮"。她看东 西时,即使戴上两片炉子盖也照样看得一清二楚。她 茫然不知所措地愣了一会儿。然后虽然不是凶神恶煞 般,但嗓门高得让每个角落都能听到,她说: "好,我发誓如果我抓住你,我就--" 她话没有说完,因为这时她正弯腰用扫把往床下 猛捣,每捣一下,她需要停下来换口气。结果,只捣 出来一只猫。 "我还从没有见过这么令人吃惊的孩子!" 她走到敞开的门口,站在那里朝满园子的西红柿 藤和吉普逊草丛中看,想找到汤姆,可还是没有。于 是她亮开嗓子朝远处,高声喊到: "汤姆呀,汤姆!"
请点击此处:利用“文本整理工具1”去掉不需要的段落符,保留正确的段落符
■语料类别7: 摘录自网络的语料,其中包含有html标记,如下表所示:
<P>TOM!"P> <P>No answer.P> <P>"TOM!"P> <P>No answer.P> <P>"What's gone with that boy, I wonder? You TOM!"P> <P>No answer.P> <P>The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them.She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for "style," not service -- she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment,and then said, not fiercely, but still loud enough for the furniture to hear:P> <P>"Well, I lay if I get hold of you I'll --"P> <P>She did not finish, for by this time she was bending down and punching under the bed with the broom, and so she needed breath to punctuate the punches with. She resurrected nothing but the cat.P>
请点击此处:利用“文本整理工具2”去掉不需要的html标记,保留纯净的文本
■语料类别8: 文本包含有特定段落分隔符,文本中原有的分行符和回车则需要删除,如下表所示(</p>就是段落分隔符):
请点击此处:利用“文本整理工具3”将文本拆分成段落