Mingyang Song1, Yan Huang2, Guojian Xu1, Zhenghong Jia1*, Li Tang2 and Zhenggang Leng1
1College of Information Science and Engineering, Xinjiang University, Urumqi, China
2Network Department, China Mobile Communications Group Xinjiang Co, Ltd Urumqi, Urumqi, China
*Corresponding author: Zhenghong Jia, College of Information Science and Engineering, Xinjiang University, Urumqi, China
Submission: February 24, 2023;Published: April 21, 2023
ISSN:2832-4463 Volume3 Issue1
The article proposes a system solution for analyzing logs using big data. It adopts the Hadoop ecological big data processing framework and the calculation method of the Spark engine. In terms of receiving data, it adopts the current mainstream page crawling tool Scrapy and uses crawlers to supplement the data we want [1-5]. To obtain the company registration and filing data, compare the log information in the big data luster, and filter out these companies’ domain name aliases and IP information from the logs. Build a data warehouse model, divide the fine-grained data from data acquisition to data analysis, filter the data layer by layer, optimize the data retrieval and transaction management, and use the standardized dimensional data model to adjust the performance of the database, so that the database can be retrieved very quickly, and the organization of the data warehouse is easier for users to understand and use, and the requirements for different functional granularities of daily analysis and weekly analysis are determined [6-9].
Build a resale analysis platform, display resale statistical analysis through the UI interface of Spring boot architecture, use LayUI and Bootstrap to design front-end web pages, Spring Security for security verification, Echarte data reports, and Ajax front-end interaction. The backend uses MySQL data and python scripts for data analysis [10-13].
The contributions are as follows:
A. Obtain the sub-domain names registered by TOP55 companies through Scrapy
crawlers to establish the TOP55 customer domain name information database.
B. Propose an improved generalized suffix automaton algorithm, build a big data
platform to deduplicate and clean the DNS log fields, synthesize the subdomain name
database into a generalized suffix automaton tree, input the domain name field of each
line in the DNS log, and retrieve the matching The name domain name and IP in the log.
C. Adding a caching middleware algorithm in the Scrapy framework is proposed. The
Scrapy crawler obtains the corresponding attribution company of the CNAME domain
name and avoids repeatedly executing the attribution crawling of the same name by
asking the cache middleware whether it already exists before crawling the attribution of
the name Fetching greatly reduces the time spent on crawlers.
D. Use the python-based pandas matching and continuous regularization algorithm to
find the IP corresponding to the IP.
E. The Spring boot platform builds the TOP55 customer resale behavior analysis page
platform, analyzes the resale times and resale time of specific companies, and draws the
resale distribution map.
© 2023 Zhenghong Jia. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.