Building knowledge graphs from technical documents using named entity recognition and edge weight updating neural network with triplet loss for entity normalization

Jeon, Sung Hwan; Lee, Hye Jin; Park, Jihye; Cho, Sungzoon

doi:10.3233/IDA-227129

Building knowledge graphs from technical documents using named entity recognition and edge weight updating neural network with triplet loss for entity normalization

Article type: Research Article

Authors: Jeon, Sung Hwan^a | Lee, Hye Jin^a | Park, Jihye^a | Cho, Sungzoon^{a; b; *}

Affiliations: [a] Department of Industrial Engineering, Seoul National University, Gwanak-ro, Gwanak-gu, Seoul, Korea | [b] Institute for Industrial Systems Innovation, Seoul National University, Gwanak-ro, Gwanak-gu, Seoul, Korea

Correspondence: [*] Corresponding author: Sungzoon Cho, Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Gwanak-ro, Gwanak-gu, Seoul, Korea. E-mail: [email protected].

Abstract: Attempts to express information from various documents in graph form are rapidly increasing. The speed and volume in which these documents are being generated call for an automated process, based on machine learning techniques, for cost-effective and timely analysis. Past studies responded to such needs by building knowledge graphs or technology trees from the bibliographic information of documents, or by relying on text mining techniques in order to extract keywords and/or phrases. While these approaches provide an intuitive glance into the technological hotspots or the key features of the select field, there still is room for improvement, especially in terms of recognizing the same entities appearing in different forms so as to interconnect closely related technological concepts properly. In this paper, we propose to build a patent knowledge network using the United States Patent and Trademark Office (USPTO) patent filings for the semiconductor device sector by fine-tuning Huggingface’s named entity recognition (NER) model with our novel edge weight updating neural network. For the named entity normalization, we employ edge weight updating neural network with positive and negative candidates that are chosen by substring matching techniques. Experiment results show that our proposed approach performs very competitively against the conventional keyword extraction models frequently employed in patent analysis, especially for the named entity normalization (NEN) and document retrieval tasks. By grouping entities with named entity normalization model, the resulting knowledge graph achieves higher scores in retrieval tasks. We also show that our model is robust to the out-of-vocabulary problem by employing the fine-tuned BERT NER model.

Keywords: Knowledge graph, named entity normalization, information extraction, keyword extraction

DOI: 10.3233/IDA-227129

Journal: Intelligent Data Analysis, vol. 28, no. 1, pp. 331-355, 2024

Published: 3 February 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia