Static code check method. Analysis of legacy code when the source code is lost: to do or not to do? Combining the sequence of actions of various types of analysis in the development environment

Introduction

Standard features of software products and various control systems are insufficient for most customers. Website management systems (for example, WordPress, Joomla or Bitrix), accounting programs, customer management systems (CRM), enterprise and production (for example, 1C and SAP) provide ample opportunities to expand functionality and adapt to the needs of specific customers. Such opportunities are implemented using third-party modules, custom-made, or customization of existing ones. These modules are program code written in one of the built-in programming languages that interacts with the system and implements the functionality required by customers.

Not all organizations realize that custom-made embeddable code or a website may contain serious vulnerabilities, the exploitation of which by an attacker can lead to the leakage of confidential information, and program tabs are special sections of code designed to perform any operation on secret commands known to the code developer. . In addition, custom-made code can contain bugs that can destroy or corrupt databases or disrupt well-established business processes.

Companies that are familiar with the risks described above try to involve auditors and program source code analysis specialists in the acceptance of ready-made modules, so that experts determine the security of the developed solution and make sure that they do not contain vulnerabilities, errors, or program tabs. But this method of control has a number of disadvantages. Firstly, this service seriously increases the development budget; secondly, the audit and analysis takes a long time - from a week to several months; and thirdly, this approach does not guarantee the complete absence of problems with the analyzed code - there is a possibility of human error and detection of previously unknown attack vectors after the code has been accepted and started to operate.

There is a secure development methodology that provides for the integration of audit and code control processes at the stage of creating a software product - SDL (Security Development Lifecycle, secure development life cycle). However, only a software developer can apply this methodology, if we talk about customers, then SDL is not applicable for them, since the process involves the restructuring of code generation algorithms and it is too late to use it upon acceptance. In addition, many developments affect a small part of existing code, in which case SDL is also not applicable.

To solve the problem of source code auditing and provide protection against exploitation of vulnerabilities in embedded codes and web applications, there are source code analyzers.

Classification of source code analyzers

Source code analyzers are a class of software products designed to detect and prevent the exploitation of software errors in source codes. All products aimed at source code analysis can be conditionally divided into three types:

The first group includes web application code analyzers and tools for preventing the exploitation of website vulnerabilities.
The second group is embeddable code analyzers that allow you to detect problem areas in the source code of modules designed to expand the functionality of corporate and production systems. These modules include programs for the 1C product line, extensions of CRM systems, enterprise management systems and SAP systems.
The last group is intended for source code analysis in various programming languages that are not related to business applications and web applications. Such analyzers are intended for customers and software developers. In particular, this group of analyzers is used to use the methodology of secure software development. Static code analyzers find problems and potential vulnerabilities in source codes and provide recommendations for fixing them.

It should be noted that most of the analyzers are of mixed types and perform the functions of analyzing a wide range of software products - web applications, embedded code and conventional software. However, in this review, the emphasis is on the use of analyzers by development customers, so more attention is paid to web application and embedded code analyzers.

Analyzers can contain various analysis mechanisms, but the most common and universal is static source code analysis - SAST (Static Application Security Testing), there are also dynamic analysis methods - DAST (Dynamic Application Security Testing), which check the code during its execution, and various hybrid options that combine different types of analyses. Dynamic analysis is a stand-alone verification method that can extend the capabilities of static analysis or be used independently in cases where access to source texts is not available. In this review, only static analyzers are considered.

Embedded code analyzers and web application analyzers differ in their set of characteristics. It includes not only the quality of analysis and the list of supported software products and programming languages, but also additional mechanisms: the ability to perform automatic error correction, the availability of functions to prevent the exploitation of errors without code changes, the ability to update the built-in database of vulnerabilities and programming errors, the availability of certificates of conformity and ability to meet the requirements of various regulators.

How source code analyzers work

The general principles of operation are similar for all analyzer classes: both web application source code analyzers and embeddable code analyzers. The difference between these types of products is only in the ability to determine the features of code execution and interaction with the outside world, which is reflected in the analyzer vulnerability databases. Most of the analyzers on the market perform the functions of both classes, checking both code embedded in business applications and web application code equally well.

The input data for the source code analyzer is an array of program source codes and its dependencies (loadable modules, third-party software used, etc.). As a result of the work, all analyzers issue a report on detected vulnerabilities and programming errors, in addition, some analyzers provide functions for automatic error correction.

It should be noted that automatic error correction does not always work correctly, so this functionality is intended only for developers of web applications and embedded modules, the customer of the product should rely only on the final report of the analyzer and use the received data to make a decision on acceptance and implementation of the developed code or sending it for revision.

Figure 1. Algorithm of the source code analyzer

When evaluating source code, analyzers use various databases containing descriptions of vulnerabilities and programming errors:

Own database of vulnerabilities and programming errors - each developer of source code analyzers has its own analytics and research departments, which prepare specialized databases for analyzing source codes of programs. The quality of your own database is one of the key criteria that affects the overall quality of the product. In addition, your own database must be dynamic and constantly updated - new attack vectors and exploitation of vulnerabilities, as well as changes in programming languages and development methods, require analyzer developers to constantly update the database to maintain high quality checks. Products with a static non-updatable base most often lose in comparative tests.
National databases of programming errors - there are a number of national databases of vulnerabilities, which are compiled and maintained by regulators in different countries. For example, in the United States, the CWE base is used - Common Weakness Enumeration, which is maintained by the MITER organization, which is supported, among other things, by the US Department of Defense. Russia does not yet have a similar database, but the FSTEC of Russia plans in the future to supplement its databases of vulnerabilities and threats with a database of programming errors. Vulnerability analyzers implement support for the CWE database by embedding it into their own vulnerability database or using it as a separate verification mechanism.
Standards requirements and recommendations for secure programming - there are both a number of government and industry standards that describe the requirements for secure application development, as well as a number of recommendations and "best practices" from world experts in the field of software development and protection. These documents do not directly describe programming errors, unlike CWE, but contain a list of methods that can be converted for use in a static source code analyzer.

The quality of the analysis, the number of false positives and missed errors directly depend on what bases are used in the analyzer. In addition, the analysis for compliance with the requirements of regulators makes it possible to facilitate and simplify the procedure for an external audit of the infrastructure and information system in the event that the requirements are mandatory. For example, PCI DSS requirements are mandatory for web applications and embedded code that works with bank card payment information, while an external audit of PCI DSS compliance is carried out, including an analysis of the software products used.

World market

There are many different analyzers on the world market - both from well-known vendors in the field of security, and niche players dealing only with this class of products. The Gartner Analytical Center has been classifying and evaluating source code analyzers for more than five years, while until 2011 Gartner singled out the static analyzers discussed in this article separately, later combining them into a higher class - application security testing tools (Application Security Testing). ).

In the Gartner Magic Quadrant in 2015, HP, Veracode, and IBM are the market leaders in security testing. At the same time, Veracode is the only one of the leading companies that does not have an analyzer as a software product, and the functionality is provided only as a service in the Veracode cloud. The remaining leading companies offer either exclusively products that perform checks on users' computers, or the ability to choose between a product and a cloud service. HP and IBM remain the world market leaders over the past five years, an overview of their products is given below. The product of Checkmarx, which specializes only in this class of products, is closest to the leading positions, so it is also included in the review.

Figure 2. Magic Quadrant of AnalystsGartner by Application Security Analysis Market Players in August 2015

According to the ReportsnReports analyst report, in the US the market size of source code analyzers in 2014 amounted to $2.5 billion, by 2019 it is projected to double to $5 billion with an annual growth of 14.9%. More than 50% of the organizations surveyed for this report plan to allocate and increase budgets for source code analysis in custom development, and only 3% spoke negatively about the use of these products.

The large number of products in the challengers area confirms the popularity of this class of products and the rapid development of the industry. Over the past five years, the total number of manufacturers in this quadrant has almost tripled, and three products have been added compared to the 2014 report.

Russian market

The Russian market of source code analyzers is quite young - the first public products began to appear on the market less than five years ago. At the same time, the market was formed from two directions - on the one hand, companies developing products for testing to identify undeclared capabilities in the laboratories of the FSTEC, the FSB and the Ministry of Defense of the Russian Federation; on the other hand, companies involved in various areas of security have decided to add a new class of products to their portfolio.

The most notable players in the new market are Positive Technologies, InfoWatch, and Solar Security. Positive Technologies has long specialized in finding and analyzing vulnerabilities; their portfolio includes the MaxPatrol product, one of the domestic market leaders in external security control, so it is not surprising that the company decided to do internal analysis and develop its own source code analyzer. InfoWatch developed as a developer of DLP systems, eventually turning into a group of companies in search of new market niches. Appercut joined InfoWatch in 2012, adding a source code analysis tool to the InfoWatch portfolio. Investments and experience of InfoWatch allowed to quickly develop the product to a high level. Solar Security officially presented their Solar inCode product only at the end of October 2015, but already at the time of release they had four official implementations in Russia.

Companies that have been developing source code parsers for certification tests for decades are generally in no hurry to offer parsers for business, so our review contains only one such product - from Echelon. Perhaps, in the future, it will be able to oust other market players, primarily due to the great theoretical and practical experience of the developers of this product in the field of searching for vulnerabilities and undeclared features.

Another niche player on the Russian market is Digital Security, an information security consulting company. Having extensive experience in conducting audits and implementing ERP systems, she found an empty niche and took up the development of a product for analyzing the security of ERP systems, which, among other functions, contains mechanisms for analyzing source codes for embedded programs.

Brief overview of analyzers

The first source code analysis tool in our review is a product of Fortify, owned by Hewlett-Packard since 2010. The HP Fortify line includes various products for analyzing program codes: there is also the Fortify On-Demand SaaS service, which involves uploading source code to the HP cloud, and the full-fledged HP Fortify Static Code Analyzer application installed on the customer's infrastructure.

HP Fortify Static Code Analyzer supports a wide range of programming languages and platforms, including web applications written in PHP, Python, Java/JSP, ASP.Net, and JavaScript, and embeddable code in ABAP (SAP), Action Script, and VBScript.

Figure 3. HP Fortify Static Code Analyzer Interface

Among the features of the product, it is worth highlighting the presence in HP Fortify Static Code Analyzer of support for integration with various development management systems and error tracking. If the code developer provides the customer with access to direct bug reporting to Bugzilla, HP Quality Center, or Microsoft TFS, the analyzer can automatically generate bug reports on those systems without the need for manual steps.

The operation of the product is based on HP Fortify's own knowledge bases, formed by adapting the CWE base. The product implements an analysis to meet the requirements of DISA STIG, FISMA, PCI DSS and OWASP recommendations.

Among the shortcomings of HP Fortify Static Code Analyzer, it should be noted the lack of localization of the product for the Russian market - the interface and reports are in English, the lack of materials and documentation for the product in Russian, the analysis of embedded code for 1C and other domestic enterprise-level products is not supported.

Benefits of HP Fortify Static Code Analyzer:

well-known brand, high quality solution;
a large list of analyzed programming languages and supported development environments;
the ability to integrate with development management systems and other HP Fortify products;
support for international standards, recommendations and “best practices”.

Checkmarx CxSAST is a tool of the American-Israeli company Checkmarx, which specializes in the development of source code analyzers. This product is intended primarily for the analysis of conventional software, but due to the support of the programming languages PHP, Python, JavaScript, Perl and Ruby, it is great for analyzing web applications. Checkmarx CxSAST is a universal analyzer that does not have a pronounced specificity and therefore is suitable for use at any stage of the software product life cycle - from development to application.

Figure 4. Checkmarx CxSAST Interface

Checkmarx CxSAST implements support for the CWE program code error base, OWASP and SANS 25 compliance checks, PCI DSS, HIPAA, MISRA, FISMA and BSIMM standards are supported. All problems detected by Checkmarx CxSAST are categorized by severity, ranging from minor to critical. One of the features of the product is the presence of functions for visualizing code with the construction of flowcharts of execution routes and recommendations for correcting problems with linking to a graphical diagram.

The disadvantages of the product include the lack of support for analyzing code embedded in business applications, the lack of localization and the difficulty of using the product for customers of program code, since the solution is intended primarily for developers and is closely integrated with development environments.

Benefits of Checkmarx CxSAST:

a large number of supported programming languages;
high speed of the product, the ability to scan only nominal sections of the code;
the ability to visualize execution graphs of the analyzed code;
visual reports and graphically designed source code metrics.

Another product from a well-known vendor is the IBM Security AppScan Source code analyzer. The AppScan line includes many products related to secure software development, but other products will not be suitable for the use of software code by customers, as they have a lot of unnecessary functionality. IBM Security AppScan Source, like Checkmarx CxSAST, is primarily designed for developer organizations, while supporting even fewer web development languages - only PHP, Perl and JavaScript. Programming languages for code embedded in business applications are not supported.

Figure 5. IBM Security AppScan Source Interface

IBM Security AppScan Source is tightly integrated with the IBM Rational development platform, so the product is most often used during the development and testing phase of software products and is not well suited for acceptance or verification of a custom application.

A feature of IBM Security AppScan Source is that it supports program analysis for IBM Worklight, a platform for mobile business applications. The list of supported standards and requirements is sparse - PCI DSS and DISA and OWASP recommendations, the vulnerability database compares the found problems with CWE.

No particular advantages of this solution for development customers have been identified.

AppChecker from the domestic company CJSC NPO Echelon is a solution that has appeared on the market quite recently. The first version of the product was released only a year ago, but the experience of Echelon in code analysis should be taken into account. "NPO Echelon" is a testing laboratory of FSTEC, FSB and the Ministry of Defense of the Russian Federation and has extensive experience in the field of static and dynamic analysis of program source codes.

Figure 6. Interface "Echelon" AppChecker

AppChecker is designed to analyze a variety of software and web applications written in PHP, Java and C/C++. Fully supports CWE vulnerability classification and considers OWASP, CERT and NISP recommendations. The product can be used to perform an audit for compliance with PCI DSS requirements and the Bank of Russia standard IBBS-2.6-2014.

The disadvantages of the product are due to the early stage of development of the solution - there is not enough support for popular web development languages and the ability to analyze embedded code.

Advantages:

the possibility of conducting an audit according to domestic requirements and PCI DSS;
taking into account the influence of the features of programming languages due to the flexible configuration of the analyzed projects;
low cost.

PT Application Inspector is a product of the Russian developer Positive Technologies, which is distinguished by its approach to solving the problem of source code analysis. PT Application Inspector is primarily aimed at finding vulnerabilities in the code, and not at identifying common software errors.

Unlike all other products in this review, PT Application Inspector has not only the ability to report and demonstrate vulnerabilities, but also the ability to automatically create exploits for certain categories and types of vulnerabilities - small executable modules that exploit the vulnerabilities found. With the help of the created exploits, it is possible in practice to check the danger of the vulnerabilities found, as well as to control the developer by checking the operation of the exploit after the declared closure of the vulnerability.

Figure 7. PT Application Inspector Interface

PT Application Inspector supports both web application development languages (PHP, JavaScript) and embedded code for business applications - SAP ABAP, SAP Java, Oracle EBS Java, Oracle EBS PL/SQL. Also, the PT Application Inspector product supports the visualization of program execution routes.

PT Application Inspector is a one-stop solution for both developers and customers running custom web applications and business application plug-ins. The database of vulnerabilities and errors in the program code contains Positive Technologies' own developments, the CWE database and WASC (Web Consortium Vulnerability Database, an analogue of CWE for web applications).

Using PT Application Inspector allows you to meet the requirements of PCI DSS, STO BR IBBS, as well as Order 17 of the FSTEC and the requirement for the absence of undeclared capabilities (relevant for code certification).

Advantages:

support for web application analysis and a large set of development systems for business applications;
domestic, localized product;
a wide range of supported state standards;
using the WASC web application vulnerability database and the CWE classifier;
the ability to visualize the program code and search for program bookmarks.

InfoWatch Appercut was developed by the Russian company InfoWatch. The main difference between this product and all the others in this collection is its specialization in providing a service for customers of business applications.

InfoWatch Appercut supports almost all programming languages that create web applications (JavaScript, Python, PHP, Ruby) and plug-ins for business offers - 1C, ABAP, X++ (ERP Microsoft Axapta), Java, Lotus Script. InfoWatch Appercut has the ability to adapt to the specifics of a particular application and the uniqueness of each company's business processes.

Figure 8. InfoWatch Appercut interface

InfoWatch Appercut supports many requirements for efficient and secure programming, including general PCI DSS and HIPPA requirements, CERT and OWAST recommendations and "best practices", as well as recommendations from business process platform manufacturers - 1C, SAP, Oracle, Microsoft.

Advantages:

domestic, localized product certified by the FSTEC of Russia;
the only product that supports all popular business platforms in Russia, including 1C, SAP, Oracle EBS, IBM Collaboration Solutions (Lotus) and Microsoft Axapta;
a fast scanner that performs checks in seconds and is able to check only modified code and code snippets.

Digital Security ERPScan is a specialized product for analyzing and monitoring the security of business systems built on SAP products, the first version was released in 2010. In addition to the configuration, vulnerability and access control (SOD) module, ERPScan includes a source code security assessment module that implements the functions of searching for bookmarks, critical calls, vulnerabilities and programming errors in the code in the ABAP and Java programming languages. At the same time, the product takes into account the specifics of the SAP platform, correlates detected vulnerabilities in the code with configuration settings and access rights, and performs analysis better than non-specialized products that work with the same programming languages.

Figure 9 Digital Security ERPScan Interface

Additional features of ERPScan include the ability to automatically generate patches for detected vulnerabilities, as well as generate signatures for possible attacks and upload these signatures to intrusion detection and prevention systems (in partnership with CISCO). In addition, the system contains mechanisms for evaluating the performance of embedded code, which is critical for business applications, since the slow operation of additional modules can seriously affect business processes in an organization. The system also supports analysis in accordance with specific recommendations for business application code analysis, such as EAS-SEC and BIZEC, as well as general PCI DSS and OWASP recommendations.

Advantages:

deep specialization on one platform of business applications with analysis correlation with configuration settings and access rights;
embedded code performance tests;
automatic creation of fixes for found vulnerabilities and virtual patches;
search for zero-day vulnerabilities.

Solar inCode is a static code analysis tool designed to detect information security vulnerabilities and undeclared features in software source code. The main distinguishing feature of the product is the ability to restore the source code of applications from a working file using decompilation technology (reverse engineering).

Solar inCode allows you to analyze source code written in the Java, Scala, Java for Android, PHP and Objective C programming languages. Unlike most competitors, the list of supported programming languages includes development tools for Android and iOS mobile platforms.

Figure 10 Interface

In cases where the source code is not available, Solar inCode allows analysis of finished applications, this functionality supports web applications and mobile applications. In particular, for mobile applications, simply copy the link to the application from Google Play or Apple Store into the scanner, the application will be automatically downloaded, decompiled and checked.

Using Solar inCode allows you to meet the requirements of PCI DSS, STO BR IBBS, as well as Order 17 of the FSTEC and the requirement for the absence of undeclared capabilities (relevant for code certification).

Advantages:

Support for application analysis for mobile devices running Android and iOS;
supports the analysis of web applications and mobile applications without using the source code of programs;
gives the results of the analysis in the format of specific recommendations for eliminating vulnerabilities;
generates detailed recommendations for setting up protection tools: SIEM, WAF, FW, NGFW;
easily integrated into the secure software development process by supporting work with source code repositories.

conclusions

The presence of software bugs, vulnerabilities and bookmarks in custom-developed software, whether it be web applications or plug-ins for business applications, is a serious risk to the security of corporate data. The use of source code analyzers allows you to significantly reduce these risks and control the quality of work performed by code developers without the need for additional time and money spent on the services of experts and external auditors. At the same time, the use of source code analyzers, most often, does not require special training, the allocation of individual employees and does not introduce other inconveniences if the product is used only for acceptance and the developer performs error correction. All this makes this tool mandatory for use when using custom developments.

When choosing a source code analyzer, one should proceed from the functionality of the products and the quality of their work. First of all, you should pay attention to the product's ability to perform checks for programming languages in which the checked source codes are implemented. The next criterion in choosing a product should be the quality of the test, which can be determined by the competence of the developer company and during the demonstration operation of the product. Another factor for choosing a product may be the possibility of auditing for compliance with the requirements of national and international standards, if their implementation is required for corporate business processes.

In this review, the clear leader among foreign products in terms of programming language support and scanning quality is the HP Fortify Static Code Analyzer solution. Checkmarx CxSAST is also a good product, but it can only analyze regular applications and web applications, there is no support for plug-ins for business applications in the product. The IBM Security AppScan Source solution looks pale compared to competitors and does not differ in either functionality or quality of checks. However, this product is not intended for business users and is intended for use in development companies, where it can show greater efficiency than competitors.

Among the Russian products it is difficult to single out a clear leader, the market is represented by three main products - InfoWatch Appercut, PT Application Inspector and Solar inCode. At the same time, these products differ significantly in technology and are designed for different target audiences - the first one supports more platforms of business applications and is characterized by high speed due to the search for vulnerabilities using exclusively static analysis methods. The second one combines static and dynamic analysis, as well as their combination, which, along with improving the quality of scanning, leads to an increase in the time for checking the source code. The third is aimed at solving the problems of business users and information security specialists, and also allows you to check applications without access to the source code.

The “echelon” of AppChecker is not yet up to the competition and has a small set of functionality, but given the early stage of product development, it is quite possible that in the near future it may claim top lines in the ratings of source code analyzers.

Digital Security ERPScan is an excellent product for solving the highly specialized task of analyzing business applications for the SAP platform. Focusing only on this market, Digital Security has developed a product that is unique in its functionality, which not only analyzes the source code, but also takes into account all the specifics of the SAP platform, specific configuration settings and access rights of business applications, and also has the ability to automatically create corrections to discovered vulnerabilities.

annotation

Static analysis is a way to check the source code of a program for correctness. The static analysis process consists of three steps. First, the analyzed code is divided into lexemes - constants, identifiers, etc. This operation is performed by the lexer. The tokens are then passed to the parser, which builds a code tree based on these tokens. Finally, a static analysis of the constructed tree is performed. This overview article describes three methods of static analysis: code tree traversal analysis, data flow analysis, and data flow analysis with path selection.

Introduction

Testing is an important part of the application development process. There are many different types of testing, including two types related to program code: static analysis and dynamic analysis.

Dynamic analysis is performed on the executable code of the compiled program. In this case, only user-specific behavior is checked, i.e. only the code that is executed during the test. The dynamic analyzer can find memory leaks, measure program performance, get the call stack, etc.

Static analysis allows you to check the source code of a program before it is executed. In particular, any compiler performs static analysis when compiling. However, in large real-world projects, it often becomes necessary to check the entire code for compliance with some additional requirements. These requirements can be very diverse, ranging from variable naming rules to portability (for example, code must run safely on x86 and x64 platforms). The most common requirements are:

Reliability - fewer bugs in the program under test.
Maintainability - more understandable code that is easy to change and improve.
Portability - the flexibility of the program under test when running on different platforms.
Readability - reducing the time it takes to understand the code.

Requirements can be broken down into rules and guidelines. Rules, unlike recommendations, are binding. Rules and recommendations are analogous to errors and warnings issued by code analyzers built into standard compilers.

The rules and guidelines, in turn, form the coding standard. This standard defines how a programmer should write program code. Coding standards are used by software development organizations.

The static analyzer finds lines of source code that do not seem to conform to the accepted coding standard and displays diagnostic messages so that the developer can understand the cause of the problem. The process of static analysis is similar to compilation, except that neither object nor executable code is generated. This overview provides a step-by-step description of the static analysis process.

Analysis Process

The static analysis process consists of two main steps: creating a code tree (also called ) and analyzing that tree.

In order to analyze the source code, the analyzer must first "understand" this code, i.e. parse it by composition and create a structure that describes the analyzed code in a convenient form. This form is called the code tree. To check if the code conforms to the coding standard, you need to build such a tree.

In the general case, the tree is built only for the analyzed code fragment (for example, for a particular function). In order to create a tree, the code is processed first and then .

The lexer is responsible for splitting the input data into individual tokens, as well as determining the type of these tokens and passing them sequentially to the parser. The lexer reads the source code line by line and then breaks the resulting lines into reserved words, identifiers, and constants called tokens. After receiving a token, the lexer determines its type.

Consider an exemplary algorithm for determining the type of a lexeme.

If the first character of the token is a digit, the token is considered a number, if this character is a minus sign, then it is a negative number. If a token is a number, it can be an integer or a fractional number. If the number contains the letter E, which defines exponential notation, or a decimal point, the number is considered a fraction, otherwise it is an integer. Note that this may cause a lexical error - if the parsed source code contains the token "4xyz", the lexer will consider it an integer 4. This will generate a syntax error that the parser can detect. However, such errors can also be detected by the lexer.

If the token is not a number, it can be a string. String constants can be recognized by single quotes, double quotes, or some other character, depending on the syntax of the language being parsed.

Finally, if the token is not a string, it must be an identifier, a reserved word, or a reserved character. If the lexeme does not fit into these categories, a lexical error occurs. The lexer will not handle this error itself - it will only tell the parser that a token of unknown type was encountered. The parser will handle this error.

The parser understands the grammar of the language. It is responsible for detecting syntax errors and for converting a program that does not have such errors into data structures called code trees. These structures, in turn, are fed into the static analyzer and processed by it.

While the lexer only understands the syntax of the language, the parser also understands the context. For example, let's declare a function in C language:

Int Func()(return 0;)

The lexer will process this string and parse it into tokens as shown in Table 1:

Table 1 - Tokens of the string "int Func()(return 0);".

The string will be recognized as 8 valid tokens, and these tokens will be passed to the parser.

This parser will look at the context and find out that the given set of tokens is a function declaration that takes no parameters, returns an integer, and this number is always 0.

The parser will figure this out when it creates a code tree from the tokens provided by the lexer and parses that tree. If the tokens and the tree constructed from them are considered correct, this tree will be used in static analysis. Otherwise, the parser will issue an error message.

However, the process of constructing a code tree is not limited to simply representing lexemes as a tree. Let's consider this process in more detail.

Code tree

The code tree represents the essence of the input data in the form of a tree, omitting non-essential syntax details. Such trees differ from concrete syntax trees in that they do not have nodes that represent punctuation marks, such as a semicolon that ends a line, or a comma that is placed between function arguments.

The parsers used to create code trees can be written by hand, or they can be generated by parser generators. Code trees are usually built from the bottom up.

When developing tree nodes, the first thing to do is usually determine the level of modularity. In other words, it is determined whether all language constructs will be represented by vertices of the same type, distinguished by values. As an example, consider the representation of binary arithmetic operations. One option is to use the same vertices for all binary operations, one of the attributes of which will be the type of operation, for example, "+". Another option is to use different types of vertices for different operations. In an object-oriented language, these might be classes like AddBinary, SubstractBinary, MultipleBinary, etc. that inherit from the abstract base class Binary.

As an example, consider two expressions: 1 + 2 * 3 + 4 * 5 and 1 + 2 * (3 + 4) * 5 (see Figure 1).

As can be seen from the figure, the original form of the expression can be restored by traversing the tree from left to right.

After the code tree is generated and checked, the static analyzer can determine if the source code follows the rules and guidelines specified in the coding standard.

Static Analysis Methods

There are many different methods, in particular, analysis with , data flow analysis, data flow analysis with path selection, etc. The specific implementations of these methods are different in different parsers. However, static analyzers for different programming languages can use the same underlying code (infrastructure). These frameworks contain a set of basic algorithms that can be used in different code analyzers, regardless of specific tasks and the language being analyzed. The set of supported methods and the specific implementation of these methods will again depend on the specific infrastructure. For example, the framework may allow easy creation of a parser that uses code tree traversal, but may not support dataflow analysis.

Although all three methods of static analysis listed above use a code tree built by a parser, these methods differ in their tasks and algorithms.

Tree traversal analysis, as the name implies, is performed by traversing a code tree and performing checks to see if the code conforms to an accepted coding standard, specified as a set of rules and guidelines. This is the type of analysis that compilers do.

Data flow analysis can be described as the process of gathering information about the usage, definition, and dependencies of data in the program being analyzed. Data flow analysis uses a command flow graph generated from a code tree. This graph represents all possible paths for executing a given program: vertices denote "straight" code fragments without any transitions, and edges represent possible transfers of control between these fragments. Since the analysis is performed without running the program under test, it is impossible to accurately determine the result of its execution. In other words, it is impossible to find out exactly which path the control will be transferred to. Therefore, dataflow analysis algorithms approximate the possible behavior, for example by considering both branches of an if-then-else statement, or by executing the body of a while loop with a certain precision. There is always a limitation on precision, since the data flow equations are written for some set of variables, and the number of these variables must be limited, since we are only considering programs with a finite set of statements. Therefore, there is always some upper limit on the number of unknowns, giving a bound on the accuracy. From the point of view of the instruction flow graph, static analysis considers all possible paths for executing a program to be valid. Because of this assumption, data flow analysis can only provide approximate solutions for a limited set of problems.

The data flow analysis algorithm described above does not distinguish between paths, since all possible paths, regardless of whether they are real or not, whether they will be performed often or rarely, still lead to a solution. In practice, however, only a small fraction of the potential paths are fulfilled. Moreover, the most frequently executed code tends to be an even smaller subset of all possible paths. It is logical to shorten the analyzed command flow graph and thus reduce the amount of computation by analyzing only a subset of possible paths. Path-selection analysis is performed on a reduced instruction flow graph that contains no impossible paths and paths that do not contain "dangerous" code. The criteria for choosing paths are different in different analyzers. For example, the parser may consider only paths containing dynamic array declarations, considering such declarations "dangerous" according to the parser's settings.

Conclusion

The number of static analysis methods and analyzers themselves is growing from year to year, which means that interest in static code analyzers is growing. The reason for the interest lies in the fact that the software being developed is becoming more and more complex and, therefore, it becomes impossible to check the code manually.

In this article, a brief description of the static analysis process and various methods for conducting such an analysis was given.

Bibliographic list

Dirk Giesen Philosophy and practical implementation of static analyzer tools . -Electronic data. -Dirk Giesen, cop. 1998.
James Alan Farrell Compiler Basics. -Electronic data. -James Alan Farrell, cop 1995. -Access mode: http://www.cs.man.ac.uk/~pjj/farrell/compmain.html
Joel Jones Abstract syntax tree implementation idioms . -Proceedings of the 10th Conference on Pattern Languages of Programs 2003, cop 2003.
Ciera Nicole Christopher Evaluating Static Analysis Frameworks .- Ciera Nicole, cop. 2006.
Leon Moonen A Generic Architecture for Data Flow Analysis to Support Reverse Engineering . - Proceedings of the 2nd International Workshop on the Theory and Practice of Algebraic Specifications, cop. 1997.

With static analysis, you can find many different defects and weaknesses in the source code even before the code is ready to run. On the other hand, runtime analysis, or run-time analysis, takes place on running software and detects problems as they occur, usually using sophisticated tools. One might argue that one form of analysis precedes the other, but developers can combine both methods to speed up development and testing processes, as well as to improve the quality of the product delivered.

This article first considers the method of static analysis. It can be used to prevent problems before they penetrate into the main code and ensure that the new code conforms to the standard. Using various analysis techniques, such as abstract syntax tree (AST) and code path analysis, static analysis tools can uncover hidden vulnerabilities, logical errors, implementation defects, and other problems. This can happen both at the development stage at each workstation, and during the assembly of the system. The rest of the article explores a dynamic analysis method that can be used during the module development and system integration phase, and which allows you to identify problems missed by static analysis. Dynamic analysis not only detects errors related to pointers and other incorrectness, but it is also possible to optimize the use of CPU cycles, RAM, flash memory and other resources.

The article also discusses options for combining static and dynamic analysis, which will help prevent regression to earlier stages of development as the product matures. This dual-method approach helps avoid most problems early in development, when they are easiest and cheapest to fix.

Combining the best of both worlds

Static analysis tools find bugs early in a project, usually before the executable is created. This early detection is especially useful for large embedded systems projects where developers cannot use dynamic analysis tools until the software is complete enough to run on the target system.

At the stage of static analysis, areas of the source code with weak points are discovered and described, including hidden vulnerabilities, logical errors, implementation defects, incorrectness when performing parallel operations, rarely occurring boundary conditions, and many other problems. For example, the Klocwork Insight static analysis tools perform deep analysis of the source code at the syntactic and semantic levels. These tools also perform sophisticated inter-procedural analysis of control and data flows and use advanced decoying techniques, evaluate the values that variables will take, and model the potential behavior of the program at run time.

Developers can use static analysis tools at any time during the development phase, even when only fragments of the project have been written. However, the more complete the code, the better. With static analysis, all potential code execution paths can be viewed - this rarely happens in normal testing, unless the project requires 100% code coverage. For example, static analysis can detect programming errors associated with edge conditions or path errors not tested at design time.

Since static analysis attempts to predict the behavior of a program based on a source code model, sometimes a "bug" is found that does not actually exist - this is the so-called "false positive" (false positive). Many modern static analysis tools implement improved techniques to avoid this problem and perform exceptionally accurate analysis.

Static analysis: arguments "for"

Static Analysis: Arguments Against

Used early in the software life cycle, before the code is ready to run and before testing begins.

Existing codebases that have already been tested can be analyzed.

Tools can be integrated into the development environment as part of a component used in "nightly builds" and as part of a developer's workbench toolkit.

Low cost: no need to create test programs or stubs; developers can run their own analyzes.

Software bugs and vulnerabilities may be discovered that do not necessarily lead to the failure of the program or affect the behavior of the program during its actual execution.

Non-zero probability of "false positive".

Table 1- Arguments "for" and "against" static analysis.

Dynamic analysis tools detect programming errors in the code that is being executed. At the same time, the developer has the opportunity to observe or diagnose the behavior of the application during its execution, in the ideal case, directly in the target environment.

In many cases, the dynamic analysis tool modifies the source or binary application code to install traps, or hooks, for instrumental measurements. These hooks can be used to detect program errors at run time, analyze memory usage, code coverage, and check other conditions. Dynamic analysis tools can generate accurate information about the state of the stack, which allows debuggers to find the cause of an error. Therefore, when dynamic analysis tools find a bug, it is most likely a real bug that the programmer can quickly identify and fix. It should be noted that in order to create an error situation at run time, there must be exactly the necessary conditions under which a program error occurs. Accordingly, developers must create some kind of test case to implement a particular scenario.

Dynamic analysis: arguments "for"

Dynamic Analysis: Arguments Against

Rarely "false positives" occur - high productivity in finding errors

A full stack and runtime trace can be generated to trace the cause of the error.

Errors are captured in the context of a running system, both in the real world and in simulation mode.

There is an intervention in the behavior of the system in real time; the degree of intervention depends on the number of tool inserts used. This doesn't always cause problems, but it's something to keep in mind when working with time-critical code.

The completeness of error analysis depends on the degree of code coverage. Thus, the code path containing the error must be traversed, and in the test case, the necessary conditions must be created to create an error situation.

table 2- Arguments "for" and "against" dynamic analysis.

Early error detection to reduce development costs

The sooner a software error is detected, the faster and cheaper it can be fixed. Therefore, static and dynamic analysis tools are of real value in finding bugs early in the software lifecycle. Various studies of industrial products show that correcting a problem at the stage of system testing (to confirm the quality of its work, QA) or after delivery of the system turns out to be several orders of magnitude more expensive than fixing the same problems at the software development stage. Many organizations have their own estimates for the cost of fixing defects. On fig. Figure 1 provides data on the problem under discussion, taken from the oft-cited book by Capers Jones. "Applied Software Measurement".

Rice. one- As the project progresses, the cost of fixing software defects can increase exponentially. Static and dynamic analysis tools help prevent these costs by detecting bugs early in the software life cycle.

Static Analysis

Static analysis has been in software development practice for almost as long as software development itself has existed. In its original form, the analysis was reduced to monitoring compliance with programming style standards (lint). Developers used it directly at their workplace. When it came to detecting bugs, early static analysis tools focused on what was on the surface: programming style and common syntax errors. For example, even the simplest static analysis tools can detect an error like this:

int foo(int x, int* ptr) ( if(x & 1); ( *ptr = x; return; ) ... )

Here, the erroneous use of an extra semicolon leads to potentially disastrous results: the function's input parameter pointer is redefined under unexpected conditions. The pointer is always redefined, regardless of the condition being checked.

Early analysis tools focused mainly on syntax errors. So while serious bugs could be found, most of the problems found were relatively trivial. In addition, a small enough code context was provided for the tools so that accurate results could be expected. This was because the work was done during a typical development compile/link cycle, and what the developer was doing was just a small piece of code in a large software system. Such a shortcoming led to the fact that the analysis tools were based on estimates and hypotheses regarding what might happen outside the developer's "sandbox". And this, in turn, led to the generation of an increased volume of reports with "false positives".

Subsequent generations of static analysis tools have addressed these shortcomings and expanded their reach beyond syntactic and semantic analysis. In the new tools, an extended representation or model of the generated code was built (something similar to the compilation phase), and then all possible ways of executing the code were modeled in accordance with the model. Next, logical flows were mapped onto these paths with simultaneous control of how and where data objects are created, used and destroyed. In the process of analyzing program modules, procedures for analyzing interprocedural control and data flow can be connected. It also minimizes "false positives" by using new approaches to cut off false paths, evaluate the values that variables can take on, and model potential behavior when running in real time. To generate data of this level in static analysis tools, it is necessary to analyze the entire code base of the project, carry out an integral system layout, and not just work with the results obtained in the "sandbox" on the developer's desktop.

To perform these sophisticated forms of analysis, static analysis tools deal with two main types of code checks:

Abstract syntax tree check- to check the basic syntax and structure of the code.
Code path analysis- to perform a more complete analysis, which depends on understanding the state of program data objects at a particular point in the code execution path.

Abstract syntax trees

An abstract syntax tree is simply a tree structure representation of source code, as might be generated in the pre-compiler steps. The tree contains a detailed one-to-one decomposition of the code structure, enabling tools to perform a simple search for anomalous syntax points.

It is very easy to build a checker that checks for standards regarding naming conventions and function call restrictions, such as checking for unsafe libraries. The purpose of performing AST checks is usually to draw some kind of inference from the code itself without using knowledge of how the code behaves during execution.

Many tools offer AST-based checks for a variety of languages, including open source tools such as PMD for Java. Some tools use an X-path grammar or an X-path-derived grammar to define conditions that are of interest to control programs. Other tools provide advanced mechanisms to enable users to create their own AST-based checkers. This type of review is relatively easy to perform, and many organizations are creating new review programs of this type to verify compliance with corporate coding standards or industry-recommended best practices.

Code path analysis

Let's look at a more complex example. Now, instead of looking for cases of programming style violations, we want to check whether the attempted pointer dereference will work correctly or fail:

If(x & 1) ptr = NULL; *ptr = 1;

A superficial examination of the fragment leads to the obvious conclusion that the variable ptr can be NULL if the variable x is odd, and this condition, when dereferenced, will inevitably lead to a null page. However, when creating a test program based on AST, finding such a program error is very problematic. Consider the AST tree (simplified for clarity) that would be generated for the above code snippet:

Statement Block If-statement Check-Expression Binary-operator & x 1 True-Branch Expression-statement Assignment-operator = ptr 0 Expression-statement Assignment-operator = Dereference-pointer - ptr 1 In such cases, no tree lookup or simple listing of nodes fail to detect in a reasonably generalized form an attempt being made (at least sometimes invalid) to dereference the pointer ptr. Accordingly, the parsing tool cannot simply search the syntax model. It is also necessary to analyze the life cycle of data objects as they appear and are used within the control logic during execution.

Code path analysis traces objects within execution paths so that validators can determine whether the data is being used accurately and correctly. The use of code path analysis expands the range of questions that can be answered in the course of static analysis. Instead of simply analyzing the correctness of a program's code, code path analysis attempts to determine the "intentions" of the code and check whether the code is written in accordance with those intentions. This can provide answers to the following questions:

Was the newly created object freed before all references to it were removed from scope?
Has the allowed range of values been checked for some data object before the object is passed to the OS function?
Was the character string checked for special characters before passing the string as a SQL query?
Will the copy operation cause a buffer overflow?
Is it safe to call this function at this time?

By analyzing the code execution paths in this way, both forward from event firing to the target script and back from event firing to required data initialization, the tool will be able to answer questions and issue an error report if the target script or initializations are executed or not executed as expected.

The implementation of such a capability is essential for performing advanced source code analysis. Therefore, developers should look for tools that use advanced code path analysis to detect memory leaks, invalid pointer dereferences, unsafe or invalid data transfers, concurrency violations, and many other problematic conditions.

Sequence of actions when performing static analysis

Static analysis can detect problems at two key points in the development process: during the writing of the program at the workplace and at the stage of system linking. As already mentioned, the current generation of tools works mainly at the system linking stage, when it is possible to analyze the code stream of the entire system, which leads to very accurate diagnostic results.

Unique in its kind, Klocwork Insight allows you to analyze the code created at the workplace of a particular developer, while avoiding the problems associated with inaccurate diagnostics, which is usually characteristic of tools of this kind. Klocwork provides Connected Desktop Analysis, which analyzes a developer's code with an understanding of all system dependencies. This results in local analysis that is just as accurate and powerful as centralized system analysis, but all before the code is fully assembled.

From an analysis sequencing perspective, this capability allows the developer to perform accurate and high-quality static analysis very early in the development life cycle. Klockwork Insight reports all issues to the developer's integrated environment (IDE) or command line as the developer writes code and periodically compiles/links. The issuance of such messages and reports occurs before dynamic analysis is performed and before all developers bring their codes together.

Rice. 2- Sequence of static analysis execution.

Dynamic Analysis Technology

To detect programming errors in dynamic analysis tools, small fragments of code are often inserted either directly into the source code of the program (injection into source code) or into executable code (injection into object code). In such code segments, a "sanitary check" of the state of the program is performed and an error report is issued if something is found to be incorrect or inoperable. Other functions may be involved in such tools, such as tracking memory allocation and usage over time.

Dynamic analysis technology includes:

Placement of inserts in the source code at the stage of preprocessing– a special code fragment is inserted into the source code of the application before compilation to detect errors. This approach does not require detailed knowledge of the runtime environment, and as a result, this method is popular among embedded systems testing and analysis tools. An example of such a tool is the IBM Rational Test RealTime product.
Placing Inserts in Object Code- For such a dynamic analysis tool, you must have sufficient knowledge of the runtime environment to be able to insert code directly into executable files and libraries. With this approach, you do not need to access the source code of the program or relink the application. An example of such a tool is IBM Rational Purify.
Inserting code at compile time– the developer uses special keys (options) of the compiler for incorporation into the source code. The compiler's ability to detect errors is used. For example, the GNU C/C++ 4.x compiler uses Mudflap technology to detect problems with pointer operations.
Specialized Runtime Libraries– to detect errors in the passed parameters, the developer uses debug versions of system libraries. Functions like strcpy() are infamous because of the possibility of null or erroneous pointers at runtime. When using debug versions of libraries, such "bad" parameters are detected. This technology does not require relinking of the application and affects performance to a lesser extent than the full use of inserts in the source/object code. This technology is used in the RAM analysis tool in the QNX® Momentics® IDE.

In this article, we'll look at the technologies used in the QNX Momentics developer tools, with a particular focus on GCC Mudflap and specialized runtime libraries.

GNU C/C++ Mudflap: compile-time injection into source code

The Mudflap tool, present in version 4.x of the GNU C/C++ Compiler (GCC), uses compile-time injection into source code. At the same time, during execution, structures are checked that potentially carry the possibility of errors. The focus of Mudflap is on pointer operations, as they are the source of many run-time errors for programs written in C and C++.

With the inclusion of Mudflap, the GCC compiler has another pass when it inserts the verification code for pointer operations. The pasted code usually performs validation of the values of the passed pointers. Incorrect pointer values will cause GCC to issue messages to the console's standard error output device (stderr). Mudflap's pointer checker doesn't just check pointers for null: its database stores memory addresses for actual objects and object properties, such as location in source code, date/time stamp, stack backtrace when memory is allocated and deallocated. Such a database allows you to quickly obtain the necessary data when analyzing memory access operations in the source code of the program.

Library functions like strcpy() don't check passed parameters. Such functions are not tested by Mudflap either. However, in Mudflap it is possible to create a symbol wrapper for statically linked libraries or an insert for dynamic libraries. With this technology, an additional layer is created between the application and the library, which makes it possible to check the validity of the parameters and issue a message about the occurrence of deviations. Mudflap uses a heuristic algorithm based on knowledge of the memory boundaries used by the application (heap, stack, code and data segments, and so on) to determine whether the returned pointer values are valid.

Using the GCC compiler command-line options, a developer can enable Mudflap features for inserting code fragments and controlling behavior, such as managing violations (bounds, values), performing additional checks and settings, enabling heuristics and self-diagnostics. For example, the -fmudflap switch sets the default Mudflap configuration. Compiler messages about violations found by Mudflap are output to the output console (stderr) or to the command line. The verbose output provides information about the violation and the variables and functions involved, as well as the location of the code. This information can be automatically imported into the IDE, where it is rendered and stack traced. Using this data, the developer can quickly navigate to the appropriate place in the source code of the program.

On fig. Figure 3 shows an example of how an error is presented in the IDE, along with the corresponding backtrace information. The backtrace works as a link to the source code, allowing the developer to quickly diagnose the cause of the problem.

Using Mudflap may increase link time and may decrease performance at run time. The data presented in the article “Mudflap: Pointer Use Checking for C/C++” indicates that with Mudflap enabled, the link time increases by 3…5 times, and the program starts running 1.25 to 5 times slower. It is clear that developers of time-critical applications should use this feature with caution. However, Mudflap is a powerful tool for identifying error-prone and potentially fatal code constructs. QNX plans to use the Mudflap tool in future versions of their dynamic analysis tools.

Rice. 3- Using the backtrace information displayed in the QNX Momentics IDE to find the source code that caused the error.

Debug versions of runtime libraries

Along with the use of special debugging inserts in the runtime libraries, which lead to significant additional memory and time costs at the link and runtime stages, developers can use pre-instrumented runtime libraries. In such libraries, some code is added around the function calls, the purpose of which is to check the validity of the input parameters. For example, consider an old friend, the string copy function:

strcpy(a,b);

It takes two parameters, both of which are pointers to a type char: one for original string ( b) and another for the result string ( a). Despite this simplicity, this function can be a source of many errors:

if pointer value a is zero or invalid, then copying to that destination will result in a memory access denied error;
if pointer value b is zero or invalid, then reading information from this address will result in a memory access denied error;
if at the end of the line b if the terminating character "0" is missing, more characters than expected will be copied to the destination string;
if the string size b more than the memory allocated for the string a, then more bytes than expected will be written to the specified address (a typical buffer overflow scenario).

The debug version of the library checks the parameter values ‘ a' And ' b'. The string lengths are also checked to make sure they are compatible. If an invalid parameter is found, an appropriate alarm message is issued. In the QNX Momentics environment, these error messages are imported from the target system and displayed on the screen. The QNX Momentics environment also uses memory allocation and deallocation tracking technology to enable deep analysis of RAM usage.

The debug version of the library will work with any application that uses its functions; you don't need to make any additional changes to the code. Moreover, the developer can add the library during application startup. The library will then replace the corresponding parts of the full standard library, eliminating the need to use a debug version of the full library. In the QNX Momentics IDE, a developer can add such a library at program startup as part of a normal interactive debug session. On fig. Figure 4 shows an example of how QNX Momentics detects and reports memory errors.

The debug versions of the libraries provide a proven "non-aggressive" method for detecting errors when calling library functions. This technique is ideal for RAM analysis and other analysis methods that depend on matched pairs of calls, such as malloc() and free(). In other words, this technology can only detect run-time errors for code with library calls. It does not detect many typical errors, such as inline pointer dereferences or incorrect pointer arithmetic. Typically, when debugging, only a subset of the system calls are monitored. You can learn more about this in the article.

Rice. 4- RAM analysis is done by placing traps in the area of API calls related to memory access.

Sequence of actions in dynamic analysis

In short, dynamic analysis involves capturing violation events or other significant events in the embedded target system, importing this information into the development environment, and then using visualization tools to quickly identify buggy sections of code.

As shown in fig. 5, dynamic analysis not only allows you to detect errors, but also helps to draw the attention of the developer to the details of the consumption of memory, CPU cycles, disk space and other resources. The analysis process consists of several steps, and a good dynamic analysis tool provides strong support for each step:

Observation- First of all, it captures runtime errors, detects memory leaks and displays all the results in the IDE.
Adjustment- then the developer has the opportunity to trace each error back to the offending source line. With good integration into the IDE, every error will be displayed on the screen. The developer simply needs to click on the line of the error, and the source code fragment opens with the line that breaks the work. In many cases, a developer can quickly fix a problem using the available stack trace and additional source code tools in the IDE (function call viewers, call tracers, etc.).
Profiling– By fixing detected bugs and memory leaks, the developer can analyze resource usage over time, including peaks, load averages, and resource overruns. Ideally, the analysis tool will provide a visual representation of long-term resource usage, allowing you to immediately identify spikes in memory allocation and other anomalies.
Optimization– using the information of the profiling stage, the developer can now conduct a "fine" analysis of the resource usage of the program. Among other things, such optimizations can minimize resource peaks and resource overheads, including RAM and CPU usage.

Rice. five- Typical sequence of actions for dynamic analysis

Combining the sequence of actions of various types of analysis in the development environment

Each of the static and dynamic analysis tools has its own strengths. Accordingly, development teams should use these tools in tandem. For example, static analysis tools are able to detect errors that are missed by dynamic analysis tools, because dynamic analysis tools only catch an error if the erroneous piece of code is executed during testing. On the other hand, dynamic analysis tools detect software errors in the final running process. It is hardly necessary to have a discussion about the error if the use of a null pointer has already been discovered.

Ideally, a developer will use both analysis tools in their daily work. The task is greatly facilitated if the tools are well integrated into the development environment in the workplace.

Here is an example of the joint use of two types of tools:

At the beginning of the working day, the developer views the report on the results of the nightly build. This report includes both the build errors themselves and the results of the static analysis performed during the build.
The static analysis report lists the detected defects along with information that will help you fix them, including links to the source code. Using the IDE, the developer can flag each situation as either a true bug or a "false positive". After that, the actual errors present are corrected.
The developer saves the changes made locally, within the IDE, along with any new code snippets. The developer does not commit these changes back to the source control system until the changes have been reviewed and tested.
The developer analyzes and corrects the new code using a static analysis tool in the local workplace. In order to be sure of high-quality error detection and the absence of "false positives", the analysis uses extended information at the system level. This information is taken from the nightly build/analysis process.
After analyzing and "cleaning up" any new code, the developer builds the code into a local test image or executable.
Using dynamic analysis tools, the developer runs tests to verify the changes made.
With the help of the IDE, the developer can quickly identify and fix bugs that are reported through the dynamic analysis tools. The code is considered final and ready for use when it has gone through static analysis, unit testing, and dynamic analysis.
The developer submits changes to the source control system; after that, the modified code participates in the subsequent nightly linking process.

This workflow is similar to that of medium to large projects that already use nightly builds, source code control, and code ownership. Because the tools are integrated into the IDE, developers can quickly perform static and dynamic analysis without deviating from the typical workflow. As a result, the quality of the code increases significantly already at the stage of source code creation.

The role of the RTOS architecture

Within the framework of the discussion about static and dynamic analysis tools, the mention of the RTOS architecture may seem inappropriate. But it turns out that a well-built RTOS can greatly facilitate the detection, localization, and resolution of many software bugs.

For example, in a microkernel RTOS such as QNX Neutrino, all applications, device drivers, file systems, and networking stacks reside outside the kernel in separate address spaces. As a result, all of them are isolated from the nucleus and from each other. This approach provides the highest degree of failure localization: the failure of one of the components does not lead to the collapse of the system as a whole. Moreover, it turns out that it is easy to isolate a RAM-related error or other logical error to the exact component that caused this error.

For example, if a device driver attempts to access memory outside of its process container, then the OS can identify the process, indicate the location of the error, and generate a dump file that can be viewed by source code debuggers. At the same time, the rest of the system will continue to work, and the developer can localize the problem and work on fixing it.

Rice. 6- In a microkernel OS, failures in RAM for drivers, protocol stacks, and other services will not lead to disruption of other processes or the kernel. Moreover, the OS can instantly detect an unauthorized attempt to access memory and indicate from which code this attempt was made.

Compared to a conventional OS kernel, the microkernel has an unusually short Mean Time to Repair (MTTR) after a failure. Consider what happens when a device driver crashes: the OS can shut down the driver, restore the resources used by the driver, and restart the driver. This usually takes a few milliseconds. In a typical monolithic operating system, the device must be rebooted - this process can take from a few seconds to several minutes.

Final remarks

Static analysis tools can detect programming errors even before the code is executed. Even those errors are found that are not detected at the stages of block testing, system testing, and also at the integration stage, because it is very difficult to provide full code coverage for complex applications, and this requires significant costs. In addition, development teams can use static analysis tools during regular system builds to ensure that every piece of new code is analyzed.

Meanwhile, dynamic analysis tools support the integration and testing phases by reporting errors (or potential problems) that occur during program execution to the development environment. These tools also provide complete tracing back to where the error occurred. Using this information, developers can perform post-mortem debugging of mysterious program failures or system crashes in much less time. Dynamic analysis through stack traces and variables can reveal the underlying causes of the problem - this is better than using “if (ptr != NULL)” statements all over the place to prevent and work around crashes.

The use of early detection, better and complete code test coverage, along with error correction helps developers create better quality software in a shorter time frame.

Bibliography

Eigler, Frank Ch., “Mudflap: Pointer Use Checking for C/C++”, Proceedings of the GCC Developers Summit 2003, pg. 57-70. http://www.linux.org.uk/~ajh/gcc/gccsummit-2003-proceedings.pdf
“Heap Analysis: Making Memory Errors a Thing of the Past”, QNX Neutrino RTOS Programmer’s Guide. http://pegasus.ott.qnx.com/download/download/16853/neutrino_prog.pdf

About QNX Software Systems

QNX Software Systems is a subsidiary of Harman International and a leading global provider of innovative technologies for embedded systems, including middleware, development tools, and operating systems. The QNX® Neutrino® RTOS, QNX Momentics® Development Kit, and QNX Aviage® middleware, based on a modular architecture, form the most robust and scalable software suite for building high-performance embedded systems. Leading global companies such as Cisco, Daimler, General Electric, Lockheed Martin, and Siemens are widely using QNX technologies in network routers, medical devices, vehicle telematics units, safety and security systems, industrial robots, and other critical and mission-critical applications. tasks. The company's head office is located in Ottawa (Canada), and product distributors are located in more than 100 countries around the world.

About Klocwork

Klocwork's products are designed for automated analysis of static code, detection and prevention of software defects and security issues. Our products provide development teams with the tools to identify the root causes of software quality and security deficiencies, and to track and prevent these deficiencies throughout the development process. Klocwork's patented technology was established in 1996 and has delivered a high return on investment (ROI) for more than 80 clients, many of whom are Fortune 500 companies and offer the world's most sought-after software development environments. Klocwork is a privately held company with offices in Burlington, San Jose, Chicago, Dallas (USA) and Ottawa (Canada).

Analyzing binary code, that is, code that is executed directly by a machine, is not a trivial task. In most cases, if you need to analyze a binary code, it is first restored by disassembling, and then decompiling into some high-level representation, and then they analyze what happened.

It must be said here that the code that was restored, in terms of text representation, has little in common with the code that was originally written by the programmer and compiled into an executable file. It is impossible to restore exactly a binary file obtained from compiled programming languages such as C / C ++, Fortran, since this is an algorithmically unformalized task. In the process of converting the source code that the programmer wrote into the program that the machine executes, the compiler performs irreversible transformations.

In the 90s of the last century, it was widely believed that the compiler, like a meat grinder, grinds the source program, and the task of restoring it is similar to the task of restoring a sheep from a sausage.

However, it's not all bad. In the process of obtaining a sausage, the ram loses its functionality, while the binary program retains it. If the resulting sausage could run and jump, then the tasks would be similar.

So, since the binary program has retained its functionality, we can say that it is possible to restore the executable code to a high-level representation so that the functionality of the binary program, the original representation of which does not exist, and the program, the text representation of which we have received, are equivalent.

By definition, two programs are functionally equivalent if, given the same input, both terminate or fail to terminate their execution and, if execution terminates, produce the same result.

The task of disassembling is usually solved in a semi-automatic mode, that is, the specialist does the restoration manually using interactive tools, for example, the IdaPro interactive disassembler, radare or another tool. Further, decompilation is also performed in semi-automatic mode. As a decompilation tool to assist the skilled person, use HexRays, SmartDecompiler, or another decompiler suitable for the given decompilation task.

Restoring the original text representation of the program from the byte-code can be done quite accurately. For interpreted languages such as Java or languages of the .NET family, which are translated into byte-code, the task of decompilation is solved differently. We do not consider this issue in this article.

So, binary programs can be analyzed by means of decompilation. Typically, such an analysis is performed to understand the behavior of the program in order to replace or modify it.

From the practice of working with legacy programs

Some software, written 40 years ago in the C and Fortran family of low-level languages, controls oil production equipment. The failure of this equipment can be critical for production, so changing the software is highly undesirable. However, over the years, the source codes have been lost.

A new employee of the information security department, whose responsibility was to understand how what works, found that the sensor control program writes something to disk with some regularity, but it is not clear what it writes and how this information can be used. He also had the idea that equipment monitoring could be displayed on one large screen. To do this, it was necessary to figure out how the program works, what and in what format it writes to disk, how this information can be interpreted.

To solve the problem, decompilation technology was used with subsequent analysis of the recovered code. We first disassembled the software components one at a time, then localized the code that was responsible for the input/output of information, and gradually began to restore from this code, taking into account dependencies. Then the logic of the program was restored, which made it possible to answer all the questions of the security service regarding the analyzed software.

If you need to analyze a binary program in order to restore the logic of its operation, partially or completely restore the logic of converting input data into output data, etc., it is convenient to do this using a decompiler.

In addition to such tasks, in practice there are tasks of analyzing binary programs according to information security requirements. At the same time, the customer does not always have an understanding that this analysis is very time-consuming. It would seem that decompile and run the resulting code with a static analyzer. But as a result of a qualitative analysis, it will almost never work out.

Firstly, the found vulnerabilities must be able not only to find, but also to explain. If a vulnerability has been found in a high-level language program, the analyst or code analysis tool shows in it which code fragments contain certain flaws that caused the vulnerability. What if there is no source code? How to show what code caused the vulnerability?

The decompiler recovers code that is “littered” with recovery artifacts, and it is useless to map the identified vulnerability to such code, nothing is clear anyway. Moreover, the recovered code is poorly structured and therefore does not lend itself well to code analysis tools. Explaining a vulnerability in terms of a binary program is also difficult, because the person for whom the explanation is being made must be well versed in the binary representation of programs.

Secondly, binary analysis according to information security requirements must be carried out with an understanding of what to do with the result, since it is very difficult to fix a vulnerability in a binary code, but there is no source code.

Despite all the features and difficulties of conducting a static analysis of binary programs according to the requirements of information security, there are many situations when such an analysis needs to be performed. If for some reason there is no source code, and the binary program performs functionality that is critical according to the requirements of information security, it must be checked. If vulnerabilities are discovered, such an application should be sent for revision, if possible, or an additional “shell” should be made for it, which will allow controlling the movement of sensitive information.

When the vulnerability was hidden in a binary file

If the code that the program executes has a high level of criticality, even if the source code of the program in a high-level language is available, it is useful to audit the binary file. This will help eliminate the quirks that the compiler might introduce by performing optimizing transformations. So, in September 2017, the optimization transformation performed by the Clang compiler was widely discussed. Its result was a call to a function that should never be called.

#include typedef int(*Function)(); static Function Do; static int EraseAll() ( return system("rm -rf /"); ) void NeverCalled() ( Do = EraseAll; ) int main() ( return Do(); )

As a result of optimization transformations, the compiler will receive such an assembler code. The example was compiled under Linux X86 with the -O2 flag.

Text .globl NeverCalled .align 16, 0x90 .type NeverCalled,@function NeverCalled: # @NeverCalled retl .Lfunc_end0: .size NeverCalled, .Lfunc_end0-NeverCalled .globl main .align 16, 0x90 .type main,@function main: # @ main subl $12, %esp movl $.L.str, (%esp) calll system addl $12, %esp retl .Lfunc_end1: .size main, .Lfunc_end1-main .type .L.str,@object # @.str . section .rodata.str1.1,"aMS",@progbits,1 .L.str: .asciz "rm -rf /" .size .L.str, 9

There is undefined behavior in the source code. The NeverCalled() function is called because of optimization conversions that the compiler performs. During the optimization process, it most likely performs alias analysis, and as a result, the Do() function receives the address of the NeverCalled() function. And since the main() method calls the Do() function, which is not defined, which is undefined behavior by the standard, the following result is obtained: the EraseAll() function is called, which executes the “rm -rf /” command.

The following example: as a result of optimization transformations of the compiler, we lost the check of the pointer for NULL before its dereference.

#include void Checker(int *P) ( int deadVar = *P; if (P == 0) return; *P = 8; )

Because line 3 dereferences the pointer, the compiler assumes that the pointer is non-null. Further, line 4 was removed as a result of the "removal of unreachable code" optimization, since the comparison is considered redundant, and after that, line 3 was removed by the compiler as a result of the "dead code elimination" optimization. Only line 5 remains. The assembler code resulting from compiling gcc 7.3 under Linux x86 with the -O2 flag is shown below.

Text .p2align 4,15 .globl _Z7CheckerPi .type _Z7CheckerPi, @function _Z7CheckerPi: movl 4(%esp), %eax movl $8, (%eax) ret

The compiler optimization examples above are the result of undefined behavior UB in the code. However, this is perfectly normal code that most programmers will assume is safe. Today, programmers take the time to eliminate undefined behavior in a program, whereas 10 years ago they did not pay attention to it. As a result, legacy code may contain UB-related vulnerabilities.

Most modern static source code analyzers do not detect UB-related errors. Therefore, if the code performs a functionality that is critical according to information security requirements, it is necessary to check both its source codes and the code itself that will be executed.

annotation

Currently, a large number of tools have been developed to automate the search for software vulnerabilities. This article will discuss some of them.

Introduction

Static code analysis is a software analysis that is performed on the source code of programs and is implemented without actually executing the program under study.

The software often contains various vulnerabilities due to errors in the program code. Errors made in the development of programs, in some situations, lead to a crash of the program, and therefore, the normal operation of the program is disrupted: in this case, data is often changed and corrupted, the program or even the system stops. Most of the vulnerabilities are related to incorrect processing of data received from the outside, or insufficiently strict verification of them.

To identify vulnerabilities, various tools are used, for example, static analyzers of the source code of the program, an overview of which is given in this article.

Classification of security vulnerabilities

When the requirement for the correct operation of the program on all possible input data is violated, the emergence of so-called security vulnerabilities (security vulnerability) becomes possible. Security vulnerabilities can cause one program to be used to overcome the security limitations of the entire system.

Classification of security vulnerabilities depending on software errors:

Buffer overflow. This vulnerability occurs due to the lack of control over the out-of-bounds array in memory during program execution. When a data packet that is too large overflows the limited buffer, the contents of extraneous memory cells are overwritten, and the program crashes and crashes. By the location of the buffer in the process memory, buffer overflows are distinguished on the stack (stack buffer overflow), heap (heap buffer overflow) and static data area (bss buffer overflow).
Vulnerabilities (tainted input vulnerability). Vulnerabilities can arise when user input is passed without sufficient control to an interpreter of some external language (usually a Unix shell or SQL language). In this case, the user can specify input data in such a way that the launched interpreter will execute a completely different command than that intended by the authors of the vulnerable program.
Format string vulnerability. This type of security vulnerability is a subclass of a vulnerability. It arises from insufficient parameter control when using the format I/O functions printf, fprintf, scanf, etc. of the C standard library. These functions take as one of the parameters a character string that specifies the input or output format for subsequent function arguments. If the user can set the formatting type himself, then this vulnerability could result from the failure of the string formatting functions.
Vulnerabilities as a result of synchronization errors (race conditions). Problems associated with multitasking lead to situations called: a program not designed to run in a multitasking environment may believe that, for example, the files it uses when running can not be changed by another program. As a result, an attacker who replaces the contents of these working files in time can force the program to perform certain actions.

Of course, in addition to those listed, there are other classes of security vulnerabilities.

Overview of existing analyzers

The following tools are used to detect security vulnerabilities in programs:

Dynamic debuggers. Tools that allow you to debug a program while it is running.
Static analyzers (static debuggers). Tools that use the information accumulated during the static analysis of the program.

Static analyzers indicate those places in the program where an error might be found. These suspicious code snippets can either contain a bug or be completely harmless.

This article provides an overview of several existing static analyzers. Let's take a closer look at each of them.

1.BOON

A tool that, based on deep semantic analysis, automates the process of scanning C source texts in search of vulnerabilities that can lead to buffer overflows. It detects possible defects by assuming that some values are part of an implicit type with a specific buffer size.

2.Qual

An analysis tool for finding errors in C programs. The program extends the C language with additional user-defined type specifiers. The programmer comments his program with the appropriate specifiers, and cqual checks for errors. Incorrect annotations indicate potential errors. Squal can be used to detect potential format string vulnerabilities.

3. MOPS

(MOdel checking Programs for Security) is a tool for finding security vulnerabilities in C programs. Its purpose: dynamic adjustment to ensure that the C program matches the static model. MOPS uses a software auditing model to help determine if a program conforms to a set of rules defined to create secure software.

4.ITS4, RATS, PScan, Flawfinder

The following static analyzers are used to find buffer overflow errors and format string errors:

. A simple tool that statically scans C/C++ source code to detect potential security vulnerabilities. It flags calls to potentially dangerous functions such as strcpy/memcpy and performs a superficial semantic analysis in an attempt to assess how dangerous such code is and gives advice on how to improve it.
. The RATS (Rough Auditing Tool for Security) utility processes C/C++ code and can also process Perl, PHP, and Python scripts. RATS scans the source code for potentially dangerous function calls. The purpose of this tool is not to definitively find bugs, but to provide reasonable conclusions, based on which a specialist can manually perform code verification. RATS uses a combination of security robustness checks from semantic checks in ITS4 to deep semantic analysis looking for buffer overflow defects derived from MOPS.
. Scans C source texts for potentially incorrect use of printf-like functions and finds vulnerabilities in format strings.
. Like RATS, it is a static source code scanner for programs written in C/C++. Searches for functions that are most commonly misused, assigns risk scores to them (based on information such as passed parameters), and compiles a list of potential vulnerabilities, sorting them by severity.

All these tools are similar and use only lexical and simple parsing. Therefore, the results produced by these programs may contain up to 100% false messages.

5. Bunch

A C program analysis and visualization tool that builds a dependency graph to help the auditor understand the modular structure of the program.

6 UNO

A simple source code analyzer. It was designed to catch errors such as uninitialized variables, null pointers, and out-of-bounds array errors. UNO allows you to perform a simple analysis of the control flow and data flows, perform both intra- and inter-procedural analysis, and specify user properties. But this tool has not been finalized for analyzing real applications, does not support many standard libraries, and at this stage of development does not allow analyzing any serious programs.

7. FlexeLint (PC-Lint)

This analyzer is designed to analyze the source code in order to detect various types of errors. The program performs a semantic analysis of the source code, analysis of data and control flows.

At the end of the work, messages of several main types are issued:

A null pointer is possible;
Memory allocation issues (e.g. no free() after malloc());
Problematic control flow (for example, unreachable code);
Possible buffer overflow, arithmetic overflow;
Warnings about bad and potentially dangerous code style.

8. Viva64

A tool that helps a specialist to trace potentially dangerous fragments in the source code of C/C++ programs related to the transition from 32-bit systems to 64-bit ones. Viva64 is integrated into the Microsoft Visual Studio 2005/2008 environment, which contributes to convenient work with this tool. The analyzer helps to write correct and optimized code for 64-bit systems.

9. Parasoft C++ Test

A specialized tool for Windows that allows you to automate the quality analysis of C++ code. The C++Test package parses the project and generates code to test the components contained in the project. The C++Test package does a very important job of analyzing C++ classes. After the project is loaded, it is necessary to set up test methods. The software examines each method argument and returns the corresponding value types. For these simple types, default argument values are substituted; you can define test data for user-defined types and classes. It is possible to override the default C++Test arguments and return the values obtained from the test. Of particular note is C++Test's ability to test unfinished code. The software generates stub code for any method and function that does not already exist. Simulation of external devices and user-defined inputs is supported. Both functions are re-testable. After defining test parameters for all methods, the C++Test package is ready to run executable code. The package generates test code by calling the Visual C++ compiler to prepare it. It is possible to generate tests at the method, class, file and project levels.

10. Coverage

The tools are used to identify and fix security and quality defects in critical applications. Coverity's technology removes the barriers to writing and deploying complex software by automating the discovery and remediation of critical bugs and security flaws during the development process. Coverity's tool is capable of processing tens of millions of lines of code with minimal positive error, providing 100% trace coverage.

11.KlocWork K7

The company's products are designed for automated static code analysis, detection and prevention of software defects and security issues. The company's tools serve to identify the root causes of software quality and security deficiencies, and to track and prevent these deficiencies throughout the development process.

12 Frama-C

An open, integrated set of tools for analyzing C source code. The set includes ACSL (ANSI/ISO C Specification Language) - a special language that allows you to describe in detail the specifications of C functions, for example, specify the range of valid input values of the function and the range of normal output values.

This toolkit helps to perform the following actions:

Perform formal code checks;
Look for potential execution errors;
Conduct an audit or review of the code;
Conduct code reverse engineering to improve understanding of the structure;
Generate formal documentation.

13. CodeSurfer

A program analysis tool that is not directly designed to find security vulnerability bugs. Its main advantages are:

Pointer analysis;
Various data flow analyzes (use and definition of variables, data dependency, call graph construction);
Script language.

CodeSurfer can be used to find errors in source code, to improve understanding of source code, and to reverse engineer programs. Within the framework of the CodeSurfer environment, a prototype tool for detecting security vulnerabilities was developed, but the developed tool is used only within the developer organization.

14. FxCop

Provides a means to automatically validate .NET assemblies against the Microsoft .NET Framework Design Guidelines. The compiled code is checked using reflection mechanisms, MSIL parsing and call graph analysis. As a result, FxCop is able to detect over 200 flaws (or errors) in the following areas:

Library architecture;
Localization;
Naming rules;
Performance;
Security.

FxCop provides the ability to create your own rules using a special SDK. FxCop can work both in the graphical interface and on the command line.

15.JavaChecker

It is a static Java program analyzer based on TermWare technology.

This tool allows you to identify code defects, such as:

sloppy exception handling (empty catch blocks, generic exceptions, etc.);
hiding names (for example, when the name of a class member is the same as the name of a formal method parameter);
style violations (you can set the programming style using a set of regular expressions);
violations of standard usage contracts (for example, when the equals method is overridden, but not hashCode);
synchronization violations (for example, when access to a synchronized variable is outside a synchronized block).

The set of checks can be controlled using control comments.

JavaChecker can be called from an ANT script.

16 Simian

Similarity analyzer that looks for repetitive syntax in multiple files at the same time. The program understands the syntax of various programming languages, including C#, T-SQL, JavaScript, and Visual BasicR, and can also search for repeating fragments in text files. Many customization options allow you to fine-tune the rules for finding duplicate code. For example, the threshold parameter (threshold) determines how many repeated lines of code to consider as a duplicate.

Simian is a small tool designed to efficiently find code repetitions. It does not have a graphical interface, but it can be run from the command line or accessed programmatically. The results are displayed in text form and can be presented in one of the built-in formats (for example, XML). While Simian's sparse interface and limited output capabilities require some learning curve, it helps keep the product's integrity and effectiveness. Simian is suitable for finding duplicate code in both large and small projects.

Repetitive code reduces project maintainability and upgradability. You can use Simian to quickly find duplicate code snippets in many files at the same time. Since Simian can be run from the command line, it can be included in the build process to receive warnings or stop the process in case of code repetitions.

Output

So, in this article, static source code analyzers were considered, which are auxiliary tools for a programmer. All tools are different and help track the most diverse classes of security vulnerabilities in programs. It can be concluded that static analyzers must be accurate and sensitive. But, unfortunately, static debugging tools cannot give an absolutely reliable result.