Ebook: Software Similarity and Classification
Author: Silvio Cesare Yang Xiang (auth.)
- Genre: Computers
- Tags: Systems and Data Security, Legal Aspects of Computing
- Series: SpringerBriefs in Computer Science
- Year: 2012
- Publisher: Springer-Verlag London
- Edition: 1
- Language: English
- pdf
Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book.
Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book. Table of Contents Cover Software Similarity and Classification ISBN 9781447129080 eISBN 9781447129097 Preface Acknowledgments Contents 1 Introduction Abstract 1.1 Background 1.2 Applications of Software Similarity and Classification 1.3 Motivation 1.4 Problem Formulization 1.5 Problem Overview 1.6 Aims and Scope 1.7 Book Organization References 2 Taxonomy of Program Features Abstract 2.1 Syntactic Features 2.1.1 Raw Code 2.1.2 Abstract Syntax Trees 2.1.3 Variables 2.1.4 Pointers 2.1.5 Instructions o 2.1.5.1 Assembly o 2.1.5.2 Intermediate Representations 2.1.6 Basic Blocks 2.1.7 Procedures 2.1.8 Control Flow Graphs 2.1.9 Call Graphs 2.1.10 Object Inheritances and Dependencies 2.2 Semantic Features 2.2.1 API Calls 2.2.2 Data Flow 2.2.3 Procedure Dependence Graphs 2.2.4 System Dependence Graph 2.3 Taxonomy of Features in Program Binaries 2.3.1 Object File Formats 2.3.2 Headers 2.3.3 Object Code 2.3.4 Symbols 2.3.5 Debugging Information 2.3.6 Relocations 2.3.7 Dynamic Linking Information 2.4 Case Studies 2.4.1 Portable Executable 2.4.2 Executable and Linking Format 2.4.3 Java Class File References 3 Program Transformations and Obfuscations Abstract 3.1 Compiler Optimisation and Recompilation 3.1.1 Instruction Reordering 3.1.2 Loop Invariant Code Motion 3.1.3 Code Fusion 3.1.4 Function Inlining 3.1.5 Loop Unrolling 3.1.6 Branch/Loop Inversion 3.1.7 Strength Reduction 3.1.8 Algebraic Identities 3.1.9 Register Reassignment 3.2 Program Obfuscation 3.3 Plagiarism, Software Theft, and Derivative Works 3.3.1 Semantic Changes 3.3.2 Code Insertion 3.3.3 Code Deletion 3.3.4 Code Substitution 3.3.5 Code Transposition 3.4 Malware Packing, Polymorphism, and Metamorphism 3.4.1 Dead Code Insertion 3.4.2 Instruction Substitution 3.4.3 Variable Renaming 3.4.4 Code Reordering 3.4.5 Branch Obfuscation 3.4.6 Branch Inversion and Flipping 3.4.7 Opaque Predicate Insertion 3.4.8 Malware Obfuscation Using Code Packing 3.4.9 Traditional Code Packing 3.4.10 Shifting Decode Frame 3.4.11 Instruction Virtualization and Malware Emulators 3.5 Features under Program Transformations References 4 Formal Methods of Program Analysis Abstract 4.1 Static Feature Extraction 4.2 Formal Syntax and Lexical Analysis 4.3 Parsing 4.4 Intermediate Representations 4.4.1 Intermediate Code Generation 4.4.2 Abstract Machines 4.4.3 Basic Blocks 4.4.4 Control Flow Graph 4.4.5 Call Graph 4.5 Formal Semantics of Programming Languages 4.5.1 Operational Semantics 4.5.2 Denotational Semantics 4.5.3 Axiomatic Semantics 4.6 Theorem Proving 4.6.1 Hoare Logic 4.6.2 Predicate Transformer Semantics 4.6.3 Symbolic Execution 4.7 Model Checking 4.8 Data Flow Analysis 4.8.1 Partially Ordered Sets 4.8.2 Lattices 4.8.3 Monotone Functions and Fixed Points 4.8.4 Fixed Point Solutions to Monotone Functions 4.8.5 Dataflow Equations 4.8.6 Dataflow Analysis Examples 4.8.7 Reaching Definitions 4.8.8 Live Variables 4.8.9 Available Expressions 4.8.10 Very Busy Expressions 4.8.11 Classification of Dataflow Analyses 4.9 Abstract Interpretation 4.9.1 Widening and Narrowing 4.10 Intermediate Code Optimisation 4.11 Research Opportunities References 5 Static Analysis of Binaries Abstract 5.1 Disassembly 5.2 Intermediate Code Generation 5.3 Procedure Identification 5.4 Procedure Disassembly 5.5 Control Flow Analysis, Deobfuscation and Reconstruction 5.6 Pointer Analysis 5.7 Decompilation of Binaries 5.7.1 Condition Code Elimination 5.7.2 Stack Variable Reconstruction 5.7.3 Preserved Register Detection 5.7.4 Procedure Parameter Reconstruction 5.7.5 Reconstruction of Structured Control Flow 5.7.6 Type Reconstruction 5.8 Obfuscation and Limits to Static Analysis 5.9 Research Opportunities References 6 Dynamic Analysis Abstract 6.1 Relationship to Static Analysis 6.2 Environments 6.3 Debugging 6.4 Hooking 6.5 Dynamic Binary Instrumentation 6.6 Virtualization 6.7 Application Level Emulation 6.8 Whole System Emulation References 7 Feature Extraction Abstract 7.1 Processing Program Features 7.2 Strings 7.3 Vectors 7.4 Sets 7.5 Sets of Vectors 7.6 Trees 7.7 Graphs 7.8 Embeddings 7.9 Kernels 7.10 Research Opportunities References 8 Software Birthmark Similarity Abstract 8.1 Distance Metrics 8.2 String Similarity 8.2.1 Levenshtein Distance 8.2.2 Smith-Waterman Algorithm 8.2.3 Longest Common Subsequence (LCS) 8.2.4 Normalized Compression Distance 8.3 Vector Similarity 8.3.1 Euclidean Distance 8.3.2 Manhattan Distance 8.3.3 Cosine Similarity 8.4 Set Similarity 8.4.1 Dice Coefficient 8.4.2 Jaccard Index 8.4.3 Jaccard Distance 8.4.4 Containment 8.4.5 Overlap Coefficient 8.4.6 Tversky Index 8.5 Set of Vectors Similarity 8.6 Tree Similarity 8.7 Graph Similarity 8.7.1 Graph Isomorphism 8.7.2 Graph Edit Distance 8.7.3 Maximum Common Subgraph References 9 Software Similarity Searching and Classification Abstract 9.1 Instance-Based Learning and Nearest Neighbour 9.1.1 k Nearest Neighbours Query 9.1.2 Range Query 9.1.3 Metric Trees 9.1.4 Locality Sensitive Hashing 9.1.5 Distributed Similarity Search 9.2 Statistical Machine Learning 9.2.1 Vector Space Models 9.2.2 Kernel Methods 9.3 Research Opportunities References 10 Applications Abstract 10.1 Malware Classification 10.1.1 Raw Code 10.1.2 Instructions 10.1.3 Basic Blocks 10.1.4 API Calls 10.1.5 Control Flow and Data Flow 10.1.6 Data Flow 10.1.7 Call Graph 10.1.8 Control Flow Graphs 10.2 Software Theft Detection (Static Approaches) 10.2.1 Instructions 10.2.2 Control Flow 10.2.3 API Calls 10.2.4 Object Dependencies 10.3 Software Theft Detection (Dynamic Approaches) 10.3.1 Instructions 10.3.2 Control Flow 10.3.3 API Calls 10.3.4 Dependence Graphs 10.4 Plagiarism Detection 10.4.1 Raw Code and Tokens 10.4.2 Parse Trees 10.4.3 Program Dependency Graph 10.5 Code Clone Detection 10.5.1 Raw Code and Tokens 10.5.2 Abstract Syntax Tree 10.5.3 Program Dependency Graph 10.6 Critical Analysis References 11 Future Trends and Conclusion Abstract 11.1 Future Trends 11.2 Conclusion