Dr. Hadjicostis' Current Research Projects
___________________________________________________________________________________________

Diagnosis and Assessment of Faults, Misbehavior and Threats in Distributed Systems and Networks

Sponsored by NSFInformation Technology Research (ITR) initiative.

This project is a multi-university effort that aims at developing theory and techniques for monitoring and diagnosing faults, hazards or, more generally, functional changes in dynamic systems and networks, under limited and possibly corrupted information.  Its goal is to develop a unifying and multifaceted approach to this problem by decomposing the large body of fault diagnosis research into six topics:

  1. Deterministic fault diagnosis
  2. Model-based probabilistic diagnosis
  3. Adaptive and sequential diagnosis
  4. Distributed system-level diagnosis with communication constraints in wired/wireless networks
  5. Fault diagnosis via distributed belief propagation algorithms
  6. Model-independent diagnosis

Our research team, which involves researchers from Boston University, Massachusetts Institute of Technology, University of Illinois (lead), University of Oklahoma and Yale University, leverages its expertise in the areas of fault diagnosis, sequential detection, system-level diagnosis, distributed control, modeling,analysis and performance evaluation, applied probability, graph theory, belief propagation and model reduction to the problem of detecting, identifying and localizing faults and abnormalities in dynamically evolving environments.

This is a joint project with Profs. Carolyn Beck and R. Sreenivas at UIUC, Prof. Ioannis Paschalidis at Boston University,  Prof. Sekhar Tatikonda at Yale, Prof. K. Thulasiraman at the University of Oklahoma and Prof. John Tsitsiklis at MIT.

Details
___________________________________________________________________________________________

Diagnosis and Tolerance of Faults and Misbehavior in Distributed Systems via Structured Redundancy

Sponsored by Boeing, Trusted Software Center, Information Trust Institute.

This project investigates methodologies for detecting, locating and overcoming faults and misbehavior in networks and distributed systems.  Starting from a generally applicable probabilistic description of faults and their effects, the project develops and analyzes efficient and effective heuristics for detecting and identifying (isolating) faults.  Since fault detection/isolation in this context is generally NP-Complete, we focus on effective belief-propagation algorithms that have polynomial complexity and are amenable to distributed implementation.  These algorithms allow us to investigate the role of structured redundancy in the given distributed system and can be used in a variety of contexts, ranging from equipment diagnosis and medical diagnosis to multiple intrusion detection.  The project also studies state estimation in distributed systems and analyzes its implications to trust and privacy.  In particular, we consider notions of state opacity in distributed systems (i.e., the existence of a set of states that needs to be kept opaque --- secret --- from outside observers) and develop strategies to analyze and verify them, as well as supervisory control methodologies to enforce them while minimally restricting the system behavior.  Finally, we study systematic ways of overcoming faulty or malicious behavior when performing function calculation (including simple dissemination of information) in distributed systems.  More specifically, we analyze linear iterative strategies where each node updates at each time-step a local value to be a weighted average of its own previous local value and those of its neighbors.  Such strategies not only allow (after a sufficiently large number of time-steps) each node to obtain enough information to calculate the desired function of the initial node values, but are also robust to faults or misbehavior by nodes in the network.  We are currently evaluating potential applications of such strategies in distributed system operation and maintenance.

___________________________________________________________________________________________

HMM Classification and Biosequencing

Our earlier work on probabilistic failure diagnosis in finite state machines led us naturally to the problem of classification of hidden Markov models (HMMs).  This classification becomes more challenging when errors can corrupt the output sequence via label insertions, deletions, and transpositions.  We have been able to develop optimal classification algorithms that have polynomial complexity even in the presence of certain types of loops in the associated trellis diagram.  More importantly, by exploiting the structure of the underlying HMMs, we have been able to obtain bounds on the a priori probability of HMM misclassification, i.e., the probability that we observe a sequence that is more likely generated from a model other than the one that actually generated it.  Specifically, we have been able to show that, under certain (easy to check) conditions on the graphical structure of the HMMs, the bound on the probability of HMM misclassification goes down exponentially with the length of observations.  Since existing work for protein structure prediction relies very heavily on identifying homologous sequences with known structure to be used as templates, we are interested in applying our techniques in order to evaluate our ability to discriminate between two different HMM templates.  One question of particular importance is the computational complexity associated with the use of our methods in biosequencing applications.  More generally, within the context of this project, we are interested in exploiting the implications of our classification algorithms and bounds on problems of sequence similarity, homology and alignment using probabilistic models.

___________________________________________________________________________________________

Diagnosis using Belief Propagation Algorithms

This project investigates how belief propagation algorithms can be adopted for the problem of diagnosing multiple diseases/faults/abnormalities based on a set of observed symptoms or findings.  The work is motivated by abroad range of applications (such as network security, fault detection/isolation, and medical diagnosis).  The core problem is described by a weighted bipartite graph that consists of a set of components (or diseases), a set of alarms (or symptoms) and a set of dependencies between them.  The weighted connections represent the causal relationships from the diseases to the symptoms, and the goal is to find the combination of components that has the maximum a posteriori probability (MAP) based on the alarms (symptoms/findings) observed and the a priori probabilities of components, alarms and connection failures.  Since the problem is NP-complete, the project studies computationally efficient iterative algorithms that can efficiently provide good solutions.  In addition, by analyzing the structure of the underlying bipartite graph, analytical bounds on the performance of these algorithms are obtained (e.g., in terms of the probability of making erroneous diagnosis with respect to the MAP solution).  We are particularly interested in applying our algorithms to the QuickMedical Reference (QMR) database with the ultimate goal of providing decision-support for medical diagnosis.  We are also using theoretical machinery to obtain analytical bounds on the performance of these algorithms and characterize maximally discriminatory tests.

___________________________________________________________________________________________

An Integrated Approach to Fault Tolerance in Discrete-Time Dynamic Systems

Sponsored by NSFFaculty Early Career Development (CAREER) program.

As the complexity of dynamic systems and networks grows through the continuous deployment of embedded systems and the availability of novel sensor and actuator technologies, the likelihood of temporal or permanent failures at certain components or communication links of the system increases significantly and the consequences become highly unpredictable and severe.  Even within a single digital device, the reduction of voltages and capacitances, the shrinking of transistor sizes and the sheer number of gates involved has led to a significant increase in the frequency of so-called ``soft-errors,'' and has prompted leading semiconductor manufacturers to admit that they may be facing difficult challenges in the future.  The occurrence of failures becomes a major concern when the systems involved are life-critical (such as military, transportation or medical systems), or operate in remote or inaccessible environments (where repair may be difficult or even impossible).

This project aims at obtaining systematic approaches for modeling, detecting, identifying and correcting faults in order to ensure the proper functionality of discrete-time dynamic systems or networks.  Unlike traditional control where the goal is to stabilize a given dynamic system (while perhaps maintaining some sort of optimality in the applied control input), a fault-tolerant design aims at ensuring that any deviation from the expected system behavior is confined within a small time interval (usually one discrete-time step).  In addition, the designer of a fault-tolerant system needs to account for the possibility of failures in the sensors or communication links, or even in the error detecting/correcting mechanism itself.  This project takes a system-theoretic viewpoint towards the design of fault-tolerant dynamic systems;  the main goals are to obtain resource-efficient fault-tolerant implementations and to characterize their fundamental limitations by jointly exploiting system-, coding- and information-theoretic techniques.
___________________________________________________________________________________________

Enabling Novel Digital Sequential Circuit Designs through Error Control and Noise Tolerance Techniques

Sponsored by NSFInformation Technology Research (ITR) initiative.

This project aims at evaluating the practical implications of recently developed error control and noise tolerance techniques in the construction of reliable, high performance digital sequential circuits.  The main focus is to explore how dynamic error correction (DEC) and algorithm noise-tolerant (ANT) methodologies can enable next-generation sequential circuit architectures that are cost-effective and operate at speed and energy efficiencies that potentially exceed the limits imposed by current VLSI architectures.  The objective of this research is two-fold:

Joint project with Prof. Naresh Shanbhag at UIUC.
___________________________________________________________________________________________

Fault-Tolerant Operation and Control of Energy Processing Systems

Sponsored by NSF / ONR,  Electric Power Networks Efficiency and Security (EPNES) program.

The high availability of networking and digital technologies has opened up a number of exciting possibilities for building reliable energy processing systems and automated fault detection and accommodation mechanisms.  However, before traditional fault tolerance techniques (like modular redundancy and error-control coding) can proliferate in the context of energy processing systems, a number of questions need to be addressed.  The main issues pertain to the dynamics of the underlying energy processing system (including their coupling with the fault-tolerant procedures) and the reliability of the monitoring/correcting mechanisms (which can themselves malfunction due to a power failure).  The main goal of this research project is to develop a comprehensive framework for dynamical state estimation, fault detection and fault accommodation in energy processing systems, such as terrestrial and autonomous power systems, electric drives and power electronic systems, as found in both civilian and military sectors.  In particular, this project makes connections with traditional fault tolerance techniques by developing distributed monitoring/correcting schemes and by explicitly accounting for the system dynamics before overcoming faults that affect the functionality of the system.  The methods developed in this project can have direct impact on the design and control of naval power systems, which are persistently subjected to adverse activity, and on the operation of utility systems and electric drives in civilian applications, where disturbances can come from the natural environment, intentional attacks, and the demands that distributed generation and deregulation imposes on them.

Joint project with Prof. Alex Stankovic at Northeastern University.
___________________________________________________________________________________________

Architectures for Secure and Robust Distributed Infrastructures

Sponsored by AFOSR, University Research Initiative (URI).

This is large project that involves researchers from four different academic institutions (Caltech, MIT, Stanford and UIUC) with a variety of backgrounds and expertise.  The following excerpt, taken from the project's webpage, describes the main focus of this work:  "The major barrier constraining the successful management and design of large-scale distributed infrastructures is the conspicuous lack of knowledge about their dynamical features and behaviors.  Up until very recently analysis of systems such as the Internet, or the national air traffic system, have primarily relied on the use of non-dynamical models, which neglect their complex, and frequently subtle, inherent dynamical properties. These traditional approaches have enjoyed considerable success while systems are run in predominantly cooperative and ``friendly'' environments, and provided that their performance boundaries are not approached. With the current proliferation of applications using and relying on such infrastructures, these infrastructures are becoming increasingly stressed, and as a result the incentives for malicious attacks are heightening. The stunning fact is that the fundamental assumptions under which all significant large-scale distributed infrastructures have been constructed and analyzed no longer hold; the invalidity of these non-dynamical assumptions is witnessed with the greater frequency of catastrophic failures in major infrastructures such as the Internet, the power grid, the air traffic system, and national-scale telecommunication systems.''

Within the context of this project, Dr. Hadjicostis research focuses in addressing the challenges that arise in regards to distributed or hierarchical control and coordination, fault tolerance, safety and scalability.  The initial goal of the proposed research is to develop models and algorithms that are appropriate for evaluating the sensitivity of a complex interconnected system to failures or parameter perturbations.  A familiar example that illustrates the complexity of the issues involved is the commercial air traffic network:  flight and ground operations scheduling are performed at several, highly interacting levels, including the central flow management (FAA systems command center), en-route traffic control, local flow control, departure and arrival planning at the airports (TRACON and airport tower facilities), and individual airline constraints and ground personnel limitations (operations control centers).  Malicious attacks, accidental malfunctions, personnel shortage, or delays at different components of this system can have not only localized effects but can also manifest themselves as a cascading failure in the overall system.  The coupling between different airports and air traffic operations enables a single failure, perhaps as simple as a broken conveyor belt at one airport, to potentially trigger a complicated failure mode and result in a highly undesirable global behavior.  Another familiar example is the telephone network and its vulnerability to minor failures.  What is alarming in these large interconnected systems is that an intelligent attacker that is aware of the vulnerabilities of a certain military, transportation or other critical system may be able to cause severe economic damages or chaotic consequences through a relatively ``innocent'' attack on a single component of the system.  Current techniques rely heavily on costly simulations that provide little insight or intuition;  the goal of this part of the proposed research is to develop novel models and algorithms for analyzing the effects of failures or parameter perturbations in complex dynamic systems and networks.

Details
___________________________________________________________________________________________

Hierarchical and Reconfigurable Schemes for Distributed Control over Heterogeneous Networks

Sponsored by NSFInformation Technology Research (ITR) initiative.

The main topic of this research project is the problem of reliable control of geographically distributed complex real-time systems over a heterogeneous communication network.  In its most general form, the heterogeneous network can be viewed as the backbone for information exchange between sensors, computerized control sites and actuators.  Recent technological advances in terms of cost-effective special-purpose computing architectures and high accessibility of network connectivity open up exciting possibilities for computer-based control methodologies that can be applied either centrally or distributively/hierarchically depending on the underlying application and objective.  In the former case, a central control location uses the network to gather information from the various system sensors and to relay carefully computed actions to the actuators.  In the case of distributed/hierarchical control, the network is used to exchange information between sensors, actuators, and multiple control sites.  Each such processing site receives information from a (not necessarily exclusive or fixed) subset of sensors and is responsible for sending optimal control signals to the corresponding actuator(s).

Details
___________________________________________________________________________________________

Enhanced Equalization and Decoding for EDGE, 3G and Beyond

Sponsored by Motorola.

This project investigates problems in equalization and decoding for wireless communications as applicable to EDGE, 3G and future systems.  Specifically, the project investigates the applicability of BAD and turbo-linear equalization algorithms to 3G and EDGE-type systems for wireless channels.  The project also investigates space-time coding approaches for time-varying channels.  Within the context of this project, Dr. Hadjicostis research has been focusing on soft-decision decoding algorithms for linear block codes.

Joint project with Profs. Ralf Koetter and Andrew Singer
___________________________________________________________________________________________

Fault-Tolerant Dynamic Systems and Networks

A principal focus of Dr. Hadjicostis' research is in the area of fault tolerance in dynamic systems and networks.  The reliability and performance of such systems/networks are severely hindered by their evolutionary nature: a single undetectable failure (in a component, link or node) at any time step renders their future overall functionality useless.  As the complexity of modern systems and networks increases, the development of techniques for detecting, handling and correcting failures in such dynamic environments, perhaps by taking advantage of their structural dynamics and/or interconnection topology, becomes imperative.

Dr. Hadjicostis focuses in particular on the development of a set of criteria that can be used as guidelines for the construction of fault-tolerant dynamic systems and networks, as well as in the theoretical aspects and fundamental limits and tradeoffs involved.  One of the goals of Dr. Hadjicostis' research is to relax the traditional assumption that checking mechanisms be fault-free and to construct reliable dynamic systems (networks) out of unreliable components.  These unreliable components (gates, processing elements, links or nodes) could be produced using novel technologies (e.g., quantum systems) or current technologies (e.g., silicon based technology) with relaxed constraints.
___________________________________________________________________________________________

Robust Monitoring and Testing of Complex Digital Systems and Heterogeneous Networks

Another focus of Dr. Hadjicostis' research is the development of robust schemes for monitoring and testing complex networks and systems.  Effective schemes need to detect and identify a large class of abnormal conditions or failures, while avoiding excessive overhead and hardware cost.  Due to the complexity and heterogeneity of modern systems and networks, it is imperative that our monitoring schemes are able to overcome incomplete or erroneous information supplied by imperfect components.  This research has important consequences for the real-time analysis and control of communication, transportation, or other critical networks;  it also has important ramifications for digital system testing and verification.
___________________________________________________________________________________________

Coding in Computational, Communication and Signal Processing Systems

Dr. Hadjicostis work exploits the interrelations between computational parallelism, data redundancy, communication delay and reliability to develop resource-efficient signal processing algorithms and robust end-to-end network designs.  Jointly designing processors and channel encoders in order to reduce encoding time or hardware complexity is one interesting possibility.  More generally, in networks of interconnected processing elements, the encoding and decoding implications of an error correcting code with minimal communication requirements may add a disproportionate cost in hardware or time delay.  The vast deployment of distributed computing environments and high-speed data links greatly accentuates the challenge of better characterizing the tradeoffs that are involved between communication/encoding cost versus computational or signal processing simplicity.
___________________________________________________________________________________________

Algebraic System Analysis

Algebraic techniques have resulted in powerful insights and constructions for many practical problems.  Dr. Hadjicostis is researching applications of algebraic results to the study of concurrency and parallelism in systems and networks and in linear algebras and linear system theories over fields, rings and semirings.  Of particular interest are finite state machines, state-space systems with nonnegative realizations, and max-plus ``linear'' systems.
___________________________________________________________________________________________

Write to chadjic@uiuc.edu
___________________________________________________________________________________________