New Paper: Domain-Specific Corpus and Pre-trained Models for NLP in Architecture

DOI: https://doi.org/10.1016/j.compind.2022.103733

50 Days Free Access Link: https://authors.elsevier.com/a/1fHOibquFR5MK

TL;DR

There is significant attention on the application of AI in the architecture field. The construction industry contains a large amount of textual information (e.g., engineering specifications, contracts, and construction documents), which has rich domain concepts and semantic features, containing complex domain knowledge. For example, every engineer can easily understand that “The fire resistance of firewalls in Class A and B factories and Class A, B, and C warehouses should not be less than 4.00h.” But can a computer automatically understand this? Natural language processing (NLP) technology is expected to achieve breakthroughs in this area, realizing automated text information processing and knowledge sharing, reducing manual input, and achieving industry transformation and upgrading.

However, existing deep learning-based NLP models in the architecture field still rely on a large amount of manual data annotation (as the saying goes: the more intelligence, the more manual work). Therefore, this paper explores the method of improving model performance using domain text prior knowledge without additional manual annotation. Specific work includes: (1) establishing and publicly releasing a domain-specific corpus for architecture, (2) systematically exploring the enhancement effects of different types of corpora and pre-training methods on models, and (3) proposing the architecture domain pre-trained model ARCBERT, which can automatically learn domain knowledge from unlabeled corpora and significantly improve the performance of various NLP tasks in the domain.

Abstract

As an important task in the construction industry, the information processing and retrieval of unstructured text data based on natural language processing (NLP) is receiving increasing attention. With the development of deep learning (DL) technology and open datasets for model pre-training, many NLP methods have also been further developed and improved. However, there is currently little research on domain-specific pretrained language models in the AEC field and their advantages. The main reasons are (1) the lack of a unified publicly available dataset for model evaluation, and (2) the scarcity of publicly available domain corpora for further research.

In the AEC field, deep learning-based methods still require very expensive manual annotations to prepare large training datasets. Therefore, it is necessary to explore how domain unlabeled corpora affect the performance of deep learning-based NLP tasks in the AEC field, thereby further enhancing model performance.

To address the above issues, the technical roadmap for this study is shown in Figure 1. Specifically, this study first developed a domain-specific corpus for architecture (open-sourced, access link: https://github.com/SkydustZ/AEC-domain-corpora/tree/main/domain%20corpus). Then, based on two types of domain corpora (in-domain and close-domain) and two pre-training methods (static word embedding and contextual word embedding), various pre-trained language models were trained. On this basis, this paper systematically studied the impact of pre-training corpora and methods on the performance of deep learning models using two downstream tasks in automated rule checking (ARC) (i.e., text classification (TC) and named entity recognition (NER)) as typical cases. It is worth mentioning that the ARCBERT model, further pre-trained based on the domain corpus developed in this paper, achieved optimal results on typical downstream NLP tasks in ARC, surpassing all other models. This means that significant improvements in the performance of various NLP tasks can be achieved through automatic learning from domain corpus prior knowledge without additional manual annotation.

Figure 1 Technical roadmap for enhancing deep learning models with domain corpora and pre-training methods

Development of Domain Corpus

First, this work collected a large amount of domain corpus, and then constructed in-domain corpus (this corpus consists of civil engineering regulatory texts, which are also the target corpus for ARC tasks, hence called in-domain) and close-domain corpus (this corpus consists of civil engineering materials including civil engineering encyclopedia entries, which are close to the ARC corpus, hence called close-domain).

Corpus establishment method: web scraping relevant texts, data cleaning (filtering irrelevant content, splitting long texts). The constructed domain corpus is shown in Table 1.

Table 1 Statistics of the Domain Corpus

Pre-trained Models

The pre-trained word vector models can be divided into static word embedding models and dynamic word embedding models (also known as contextual word embedding models). Static word embedding models assume that the semantics of any word do not change with context. Dynamic word embedding assumes that the semantics of words change with context.

For the training of static word embedding models, the technical roadmap is shown in Figure 2. Using Chinese Wikipedia corpus and domain corpus based on the skip-gram model, four word embedding models were trained: (1) general model, trained using the Chinese Wikipedia corpus (referred to as Wiki corpus); (2) in-domain model, trained using both Wiki and in-domain corpus; (3) close-domain model, trained using both Wiki and close-domain corpus; (4) mixed-domain model, trained using Wiki, in-domain, and close-domain corpus.

Figure 2 Method for enhancing deep learning models with architecture-specific static word embedding models

For the training of dynamic word embedding models, the technical roadmap is shown in Figure 3. First, two benchmark models were selected: (1) bert-base-chinese[1] and (2) ERNIE[2]. Then, the constructed domain corpus was combined and divided to explore the influence of the type and quantity of domain corpus. Specifically including: (1) in-domain corpus; (2) close-domain corpus; (3) mixed-domain corpus (mixed in-domain and close-domain); (4) 1/3 of the in-domain corpus; (5) 1/5 of the in-domain corpus, totaling five derived corpora.

On this basis, further pre-training was conducted using the masked language model with the aforementioned five domain corpora, resulting in the training of 10 pre-trained models, as shown in Table 2.

Figure 3 Method for enhancing deep learning models with architecture-specific dynamic word embedding models

Table 2 Statistics of the Domain Corpus

Evaluation of Performance Improvement of Deep Learning Models

To evaluate the performance improvement of the aforementioned pre-trained models on deep learning models, a series of experiments were conducted, and the following are the configurations of the experiments.

(1) Dataset selection: The methods and domain corpus in this study were evaluated on two datasets, including the regulatory TC dataset[3] and NER dataset[4].

(2) Metric selection: For the TC model, weighted F₁ (weighted F₁) was selected. For the NER model, macro average F₁ (macro F₁) was selected for measuring results. First, the precision (P), recall (R), and F₁ score (F₁) for each semantic label were calculated:

Where N {correct, labeled, true} represents the number of elements judged to be correct for each label, the total number of labeled elements, and the number of elements that are actually labeled correctly.

Then calculate the weighted F₁ and macro average F₁:

Where n_i represents the number of elements for the ith semantic label;m represents the number of types of semantic labels.

(3) Dataset division: The two datasets were randomly split in the ratio Train: Validation: Test = 0.8: 0.1: 0.1.

(4) Model training and fine-tuning: Four types of experiments were conducted, including: 1) static word embedding model for TC; 2) static word embedding model for NER; 3) further pre-trained dynamic word embedding model for TC; 4) further pre-trained dynamic word embedding model for NER.

(5) Experimental conclusions: The summary of the experimental results is shown in Figures 4 and 5. The experimental results indicate that for the TC and NER tasks in ARC, in-domain (in-domain) corpus can be used to train domain-specific word embedding models or further pre-train BERT and ERNIE models to improve model performance without additional manual annotation.

Figure 4 Performance of various models on TC tasks

Figure 5 Performance of various models on NER tasks

Conclusion

In this study, the impact of domain corpus on the performance of deep learning methods in NLP tasks in the architecture field was systematically studied. First, a domain corpus was developed and publicly released, and then, based on four experiments, the advantages of the developed domain corpus and deep learning models based on dynamic word embedding models (such as BERT) were demonstrated:

(1) For TC and NER tasks, the domain corpus can optimize deep learning models based on static word embedding and dynamic word embedding models. For TC tasks, the weighted F₁ scores of both types of models improved by 11.4% and 6.4%, respectively, while for NER tasks, the macro average F₁ scores improved by 8.7% and 5.4%, respectively.

(2) The BERT model pre-trained on the domain corpus (ARCBERT) outperformed the deep learning models based on static word embedding, with the weighted F₁ score for TC improved by 8.1%, and the macro average F₁ score for NER improved by 3.8%. Dynamic word embedding-based deep learning models (such as BERT and ERNIE) performed better in NER and TC tasks than other models.

Finally, this study proposed a pre-trained model with a weighted F₁ score of 94.4% in TC tasks, called ARCBERT _Large. It also proposed a pre-trained model with a macro average F₁ score of 81.8% in NER tasks, called ARCBERT _Small. These two models achieved globally optimal results in TC and NER tasks. The domain corpus and pre-trained model for architecture developed in this study showed good results in various NLP tasks and may inspire various future NLP-related research and applications in the architecture field.

[1] Hugging Face, 2019. Bert-base-chinese.https://huggingface.co/bert-base-chinese/tree/main

[2] Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., Wu, H., 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223.

[3] Zheng, Z., Zhou, Y.C., Chen, K.Y., Lu, X.Z., Lin, J.R., She, Z.T., 2022. Text classification-based approach for automatically evaluating building codes’ interpretability. (in preparation).

[4] Zhou, Y.C., Zheng, Z., Lin J.R., Lu X.Z., 2020. Deep natural language processing-based rule transformation for automated regulatory compliance checking. Preprint. https://doi.org/10.13140/RG.2.2.22993.45921

—End—

Related Research

Special Issue Call for Papers

Special Issue of Earthquake Engineering Structural Dynamics: “AI and Data-Driven Methods in Earthquake Engineering”

Monographs

English Edition of “Engineering Seismic Disaster Simulation: From High-Rise Buildings to Urban Areas (2nd Edition)” Published
Monograph “Urban Seismic Resilience Analysis” Published

Artificial Intelligence and Machine Learning

Report Video: “Intelligent Design of Structural Generative”
New Paper | How Does the Hysteretic Model Based on Deep Learning Have the Ability of “Error Self-Correction”? (Includes Dataset and Program)
Using “Graphs” and “Graphs” to Generate “Graphs”? Chinese is not enough | Invention Patent: Multi-modal Input Deep Neural Network, Framework Structure Beam and Column Design Method and Device
Does AI want to do structural design? It must first learn structural mechanics! | New Paper and Invention Patent: Physics Enhanced Intelligent Design Method for Shear Wall Structures
AI+PKPM | Give a building layout diagram, structural design fully automated
Oh no! The structure height changed from 50m to 100m. AI: New structure scheme completed in 5 seconds | Invention Patent: Architectural Structure AI Design Method Integrating Text and Image Data
Unveiling the Scientific Principles of AI Designing Shear Wall Structures | New Paper: Design Method for Shear Wall Structures Based on Generative Adversarial Networks
Using AI for Structural Scheme Design | Invention Patent: Shear Wall Structure Layout Method Based on Adversarial Generative Networks
5 Minutes! From Design Scheme to Completion of Calculation Book | AI Design Shear Wall Structure Case Demonstration
New Invention Patent: A Method and System for Automatically Converting Regulatory Text into Computable Logical Rules
New Paper | Tilt Photography Point Cloud + Deep Learning = Automated Simulation of Urban Wind Environment
New Paper | Convolutional Neural Networks + Wavelet Time-Frequency Maps: A New Method for Earthquake Damage Assessment Based on Time-Frequency Domain Features of Ground Motion
What Can Be Learned from Other Fields | New Paper: Structural Earthquake Response Identification Based on Deep Transfer Learning
New Paper: To Improve AI Vibration Recognition Effect, We Tried Wavelet, HHT, MFCC, CNN, LSTM
How to Accelerate Earthquake Damage Assessment by 1500 Times? | New Paper: Real-Time Assessment of Earthquake Damage Based on LSTM
New Paper: Selection of Multivariate Earthquake Intensity Indicators Based on Machine Learning Methods and Real-Time Earthquake Damage Prediction
New Paper: Taking a Picture of Vibration Signals Can Improve Vibration Control Effects!
What Structure is This Building? Yuan Fang, What Do You Think? — Sir, Times Have Changed! | New Paper: Predicting Urban Building Structural Types Using Machine Learning Methods
New Paper: Method for Evaluating Building Seismic Damage Based on Drones and Deep Learning
New Paper: Drone + Machine Learning + Urban Plastic Analysis = Near Real-Time Loss Prediction After an Earthquake

Urban Disaster Simulation and Resilient Cities

Video: Urban Seismic Plastic Analysis and Its Application in Scenario Construction and Earthquake Emergency Response
New Paper | Assessment of Window Wind Damage at Urban Scale Based on Tilt Photography, Deep Learning, and CFD
New Paper | The Impact of Strong Earthquake Network Density on Damage Assessment
New Paper: Near Real-Time Assessment of Urban Tree Wind Damage at Urban Scale: Method Framework and Tsinghua Garden Case Application
New Paper: Near Real-Time Earthquake Landslide Prediction Method Based on Measured Ground Motion (with Source Code)
New Paper: Regional Ground Motion Field Simulation Method Based on Measured Earthquake Records
New Paper: Quantitative Impact Analysis of the “Site-City” Effect on Regional Damage Assessment
New Paper: Multi-Hazard Simulation Framework Applicable to Urban Building Clusters and Individual Important Buildings Based on City Information Model
New Paper: Economic Impact of Typhoon Wind Damage on Port Operations: A Case Study of Chinese Ports
New Paper: Simulation of Post-Earthquake Fire Rescue Scenarios Based on BIM and Virtual Reality
New Paper: Community Earthquake Safety Resilience Assessment System and Community Application Demonstration
New Paper: High-altitude Emission Design and Analysis of Harmful Gases in Wuhan Huoshenshan Hospital Wards
New Paper: Prediction Method for Regional Building Seismic Damage Under Main and Aftershock Actions
New Paper: Open Source Framework for Urban Seismic Plastic Analysis
New Paper: Urban Scale Building Seismic Resilience Simulation and Post-Earthquake Repair Planning Considering Labor Resource Constraints
New Paper: Building Seismic Loss Prediction Method Combining BIM and FEMA P-58
New Paper:Throwing Bricks, Jumping Boxes, This is Also Scientific Experiment! | Simulation of Personnel Evacuation in Secondary Falling Objects Scenario During Earthquake
Review: Urban Seismic Plastic Analysis and Its Engineering Applications
New Paper: Building Seismic Loss Evaluation Applicable to Multi-LOD BIM
New Paper: Multi-LOD Damage Simulation of Urban Building Clusters and Case Study of Beijing CBD
US NSF SimCenter + Tsinghua Urban Seismic Plastic Analysis = Earthquake Simulation of 1.84 Million Buildings in San Francisco
New Paper: BIM + Next-Generation Performance-Based Design = Simulation of Secondary Fires After Sprinkler System Failure
New Paper: Simulation of Fire Spread in Rural Building Clusters and Case Study
New Paper: Challenges Faced in Building Seismic Resilient Cities
New Paper: City in Flames | Fine and Highly Realistic Simulation of Secondary Fires After Earthquake
New Paper: Urban Resilience — Thoughts Based on “The System of Systems in Three-Dimensional Space”
New Paper: How to Determine the Damage Degree of High-Rise Buildings in Urban Seismic Plastic Analysis?

High-Performance Structures and Collapse Prevention

New Damping Model Paper | Frequency Domain Distribution Resilience Variation, Free Open Source Program Available
New Paper: Research on Continuous Collapse Caused by Shear Failure in Middle Column Joints
New Paper: New Ideas for Collapse Simulation by Combining Finite Element and Physics Engines
New Paper: Preliminary Analysis and Discussion of the Collapse of a Florida Apartment Building in the US
Review Paper: Overview of the “System Capacity Design Method” for Seismic Resistance in Building Structures
New Paper: Strengthening Effects of Different Reinforcement Structures on RC Slab-Column Joint Shear Failure Performance
New Paper: Comprehensive Resilience Defense Framework Against Earthquake and Continuous Collapse for Super High-Rise Systems
New Paper: Vulnerability Assessment of Large Span Cable-Stayed Bridges Under Strong Earthquakes Based on Digital Twin
New Paper: Research on Collapse Resistance Performance of Slab-Column Joints Under Different Shear Directions
New Paper: Nonlinear Model Updating and Collapse Prediction of Large Span Cable-Stayed Bridges
New Paper: A New Generation of Earthquake-Continuous Collapse Comprehensive Defense Combination Framework — Comprehensive Resilience Defense Combination Framework
New Paper: Research on the Factors Affecting the Load-Bearing Performance of Concrete Slab-Column Structures After Middle Column Failure
New Paper: Research on the Collapse Resistance Performance of Slab-Column Joints Under Different Shear Directions
New Paper: Experimental Analysis and Calculation Assessment of the Continuous Collapse Load-Bearing Capacity of Reinforced Concrete Frame Structures After Edge Column and Edge Middle Column Failure
New Paper: Updating of Fine Finite Element Models for Large Span Cable-Stayed Bridges Based on Cluster Computing
New Paper: Numerical Simulation of Punching Shear and Behavior After Punching Failure of Reinforced Concrete Slab-Column Joints
New Paper: Experimental Research on Continuous Collapse Behavior of Flat Structures After Failure of Edge Columns and Edge Middle Columns
New Paper: Seismic & Continuous Collapse: A New Structural Measure
New Paper: Experimental Research on Dynamic Effects of Reinforced Concrete Beam-Column Structures Under Continuous Collapse
What to Do if Interstory Displacement Angle Criterion is Not Applicable to Shear Walls? | New Paper: Shear Wall Damage Assessment Method Based on Curvature
New Paper: Research on the Structural System of Earthquake-Continuous Collapse Comprehensive Defense Combination Framework
New Paper: Research on the Impact of Shear Adjustment Strategies on the Seismic Performance of Frame-Core Tube Structures
New Paper: New Calculation Method for the Load-Bearing Capacity of Resilient PC Frames Against Earthquake and Continuous Collapse
New Paper: Experimental Research on Continuous Collapse Behavior of Flat Structures After Failure of Corner Columns
New Paper: A Dual-Function Energy Dissipater Inspired by Origami
New Review Paper: Research and Practice on Continuous Collapse and Robustness of Building Structures in the 21st Century
New Paper: Research on the Force Transmission Mechanism of Slab-Column Structures After Middle Column Failure
New Paper: Can You Calculate the Load-Bearing Capacity of a Reinforced Concrete Beam Correctly? | Calculation Method for the Arching Effect of Beams
New Paper: This Concrete Frame Can Resist Earthquakes and Prevent Continuous Collapse, and Its Function Can Be Restored. Wouldn’t You Like to Take a Look?
New Paper: One Good Man Has Three Helpers | Experimental Study on Seismic Performance of Armature Truss with End Damping Devices
New Paper: What About the Anti-Buckling Support Armature Truss? A Few “Dog Bones” Are Indispensable!
New Paper: Experimental Study on Continuous Collapse of Reinforced Concrete Frame Structures Considering the Impact of Floor Slabs
New Paper: How Much Has the Seismic Safety Improved by Increasing the Load Factor for Earthquake Loads?
New Paper: Using Vibration Control Devices to Control the Seismic Floor Acceleration of Super High-Rise Buildings
New Paper: Simplified Model of Super High-Rise Buildings at 500m Level and Its Application in Structural System Comparison
OpenSees Model of 606m Super High-Rise Building

Long press to identify the QR code and follow our research dynamics

Leave a Comment Cancel reply