Intro

Education:

  • University of South Florida, Computer Science, PhD., 2016-2023
  • Tsinghua University, Computer Science, M.S., 2005
  • Tsinghua University, Computer Science, B.S., 2002

Certificates:

  • Certified Data Scientist, The Data Incubator, 2020
  • Google Analytics for Beginners, 2017
  • Advanced Google Analytics, 2017
  • MIT Big Data and Social Behavior, 2018
  • IBM Advanced Machine Learing and Signal Processing, 2023

Skills:

  • Programming Languages: Python, R, Java, C, VBA, SQL
  • Database: Oracle, Microsoft SQL Server, MySQL, Postgres
  • Information System: SAP, ArcGIS, Google Analytics
  • Data Analysis: Data Modeling, Machine Learning, Deep Learning, Data Quality Control, Data Visualization
  • Expertise: Text Mining, Image Processing, Social Network Analysis, Signal Processing

Professional Work:

  • Employer: National Center for Toxicological Research, January, 2006-December, 2010
  • Employer: Soft Challenge LLC, Tampa, Florida, February, 2011-now
  • Employer: Arkansas State University, September, 2023-July, 2024
  • Employer: Boston University, August, 2024-now

Internship Work:

  • Verinovum Inc, October, 2022 – March, 2023
    1. Data Cleaning and Formatting on OBX5 Segments in HL7 Format,
    2. Entity Resolution for Lab Tests, Medication, and Other Medical Terms in Doctor's Notes: Featured Transfromer Model (FTM) was used to perform classification. Conditional Random Fields Model (CRF) was used to extract entities. The performance can achieve 100% Accuracy
  • Financial Information Technologies LLC, October, 2021 – August, 2022
    1. Entity Resolution for Packaging Styles, Brands, and Products in Billion Invoices: Conditional Random Fields model (CRF) was used to extract entities. The performance can achieve 100% Accuracy,
    2. Product/Nonproduct Classification for Billion Invoices: Sentense-transformer was used to perform classification.
    3. Data Collection through Web Crawling and Web Scraping,
    4. Snowflake Data Warehouse Management,
  • TCM Bank LLC, February, 2020 – April, 2020
    1. Credit Risk Modeling and Analysis,
    2. Operational Risk Modeling and Analysis,
    3. Marketing Campaign Analysis,
    4. Customer Profiling,
    5. Credit Card Fraud Modeling and Analysis,
    6. Executive Monthly Financial Report Automation and Delivery.
  • Quality Counts LLC, March, 2019- August, 2020
    1. Traffic Detection,
    2. Travel Planning,
    3. Congestion Detection,
    4. Curve Safety Analysis,
    5. Internal Business Tracking and Analysis with Google Analytics.
  • Geographic Solutions, January, 2018 – June, 2019
    1. Labor Market Insights Development with VB,
    2. Labor Market Insights Visualization,
    3. Key Performance Indicator (KPI) Definition and Evaluation,
    4. Location Based Job Market Analysis.

Awards:

  • Best Paper Award, at WMSCI 2016 Conference, for the paper titled “Combining Bayesian and Semantic Analysis with Domain Knowledge”
  • Best Paper Award, at WMSCI 2013, for the paper titled “Discovery of Strong Association Rules for Attributes from Data for Program of All-Inclusive Care for the Elderly (PACE)”
  • Best Poster Award, at National Science Foundation (NSF) Bioinformatics Workshop to Foster Collaborative Research 2013, for the poster titled “Linkage Discovery with Glossaries and Topic Segmentation”
  • Microsoft Innovation Cup Student Software Contest, 1st Place. 2005
  • Best Paper Award, at WMSCI 2011 Conference, for the paper titled “Statistical Quality Control of Microarray Gene Expression Data”
  • BEA Scholarship, 2005
  • Guanghua Scholarship, 2004
  • Tsinghua Scholarship, 2003

Projects

Image Processing

In image processing, we focus on edge detection and object detection. For edge detection, we use Conditional Random Fields (CRF) to improve the existing edge detection methdology. We detect all edges with closed lines and without duplicate edges. For object detection, we add arbitrary features to the objects to deep learning model to enforce deep learning model to learn the characteristics of specific objects.

Edge Detection

The challenge in line segmentation is that the threshold of the gradients have to be determined in the first place, a set of criteria are needed to define the properties of line segments, fractions of the image need to be detected, the validation of the line segment is also determined arbitrarily. We proposed a sematic line segmentation detection model based on conditional random fields (CRF) model. Based on the gradients of the image, this methodology does not require any prior knowledge about the image, no fraction selection, no threshold, no criteria for region growth and rectangle validation.

Methodology

CRF model, as shown in figure 1, can be considered as Markov random fields model. The initial transition probability of each pixel is the same. During each iteration, the state function can average out the different among neighbors and the transition function update the featured pixels. Apparently, state function make the local pixels similar to each other. This operation is especially useful when some pixels are changed because of noise or system errors. When the significance of each pixel is computed not only with the value of the pixel itself but also the values of the adjacent neighbors, system errors and arbitrary noise can be average out. Feature function is defined with the features we pick so that it can consistently increase the significance of features and depress the significance of none features.

Figure 1. Conditional Random Fields Model

Performance

We tested the methodology with several images, such as, the image of simple lines, the image of simple objects, and the portrait of a person. We use the same feature functions and the same Gaussian kernel to detection line segments in three different images. In figure 2(a)(b)(c), we compared out results with the ones generated with Linear Time Line Segmentation Detection (LTLSD) method.

(a)

(b)

(c)

Figure 2. Comparison of CRF results with LTLSD results

In LTLSD result, as shown in figure 2(a), objects in the image are not closed. As shown in figure 2(b), many line segments are doubled which make the image not readable, some lines apparently are shadows but were extracted as lines. As shown in figure 2(c), in the background, a lot of lines are cut. Around the face and the hat, some small circles and short lines are not detected and some are not closed.

Top

Object Detection

Deep learning is generally used in image recognition. For example, deep learning can recognize handwriting of 10 digital numbers in MNIST data set. Deep learning does not require pre-selected feature set for training, as shown in figure 1. During training, deep learning model repeatedly select features, evaluate the quality of the feature set and generate output for next layer. However, What exactly does the deep learning model learn from the training set?

Figure 1. Deep Learning Model

In image processing, there are a lot of features in one image. It is information rich. Normally, one small detail is the combination of many pixels, which is too much for human to quantitatively check each one of the computation units - the pixels.

Deep learning model selects features from entire images. If there are some common patterns which can differentiate the ten categories, those patterns can be chosen as the feature set. Some features are related to the pictures, and some features are related to the objects, but, if we don't enforce the deep learning model to learn from objects, most of the features deep learning model learned are based on the pictures, which means deep learning model does not really learn the objects. This is major flaw in this process. If we change the background of the objects, deep learning can not recognize the objects any more.

Methodology

We add feature function to deep learning model so that deep learning model can be enforced to learn from the objects only, as shown in figure 2.

Figure 2. Feature Enforced Deep Learning Model

Performance

With the opportunity to use AI for good, EY 2024 challenge is to use high-resolution datasets to model coastal vulnerability and assess tropical storm damage. These models will help communities develop strategies for mitigating and recovering from severe climate events..

Figure 3. EY 2014 Challenge: Tropical storm damage detection model

Top

Social Graph Analysis

Technology Trends

Understanding the structure of the network is central in network analysis. We are interested in extracting valuable information from the network, such as trends, popularity, dynamic of the trends, and also profiles of popular items.

This prediction model can be used for many different applications. such as

  • Technology Trends: citation/patent network
  • Online market: co-purchase network
  • Social Media Trends: online news/live journal/media sharing network
  • Social Dynamics: friends network

Sample Data

This is a citation graph for high-energy physics research papers from 1991 to 2000, with a total of N = 29,555 papers and E = 352,807 citations

Data Modeling

In terms of data modeling, we need to define popular topics for each year with the citation data and also find out how many are new topics. In each popular topic, we need to find density of the topic, the change of the density in ten years, the diameter of the topic and the change of the diameter in ten years.

Lots of features are unknown in the citation network. For example, we don’t know the related topics of the papers. we don’t know how to define the popularity because we don’t have a big picture of the citation network. We can start from any paper and search along the path through the entire network. Whatever we can find is highly related to where we started. The citation network is so big that it is not feasible to try all of the papers. Also, we know that greedy search in the network is NP-hard. It is impossible to get it done in polynomial time.

Methodology

We decompose the network into two layers, as shown in figure 1. We use kronecker graph in figure 2(a) as the computation unit. Kronecker graph is the smallest symmetric graph and also is the smallest community in social network. Symmetric graph has structural properties and the mathematical properties which can support the decomposition and reconstruction.

Figure 1. Network Decomposition

Figure 2(a). Kronecker Graph

Figure 2(b). Graph Join and Project

Results

From 1991 to 2000, we summarized the number of nodes (papers and references), the number of edges (citations), maximum number of times that a paper was cited, the averages number of times that a paper was cited, as shown in figure 3.

Figure 3. Statistics of Co-citation from 1991 to 2000

Each co-citation defines a topic so that the frequent co-citations define the popular topics. The frequency of the co-citation is highly biased. We use Expectation Maximization (EM) to group co-citations into 3 clusters, and statistics of popular topics is shown in figure 4. From 1991 to 2000, the number of publications, citations, and co-citations increase. From 1991 to 1995, there are less publications, citations, and co-citations. After 1995, the increase of publications, citations, and co-citations is not as fast as those years from 1991 to 1995. The number of topics increases from 1991 to 1994. After 1994, the change in the number of topics each year is small. Sometimes, it goes up and sometimes, it goes down, and it is around the average of 800 topics each year.

Figure 4. Statistics of Popular Topics from 1991 to 2000

These trends of research in10 years indicate that technology evolves every five years and can be completely updated in 10 years. When we work the frontier research, we need to trace back for about 5 years. In more than 5 years, most of the research is about something else. The trends of new publications, citation, and co-citations are consistent which can prove each other.

The number of topics increase from 1991 to 1994. In 1991, there are no paper published about the popular topics in 2000. From 1995 to 2000, the number of topics is almost stable. the number of topics goes up and down but the changes are small. The number of topics is consistent with the number of publications, the number of citations, the number of co-citations.

Figure 5. Citation Network on Popular Topics from 1992 to 2000

Top

Text Mining

For text mining, we focus on text similarity through dimension reduction by using Latent Semantic Analysis (LSA), text classification by using Featured Transformer Model (FTM) on top of Bi-directional Encoder Representation Transformer (BERT), entity extraction by using Conditional Random Fields mode. (CRF).

  • Online News Popularity Prediction
  • Text Classification through Featured Transformer Model (FTM)
  • Entity Extraction through Conditional Random Fields model (CRF)

Online News Popularity Prediction

We have online news from three different websites: Facebook, GooglePlus and LinkedIn. We want to find out, for any news, if we can predict its popularity, based on the popularity of the existing news. This prediction model can be used for many different applications. such as

  • advertising
  • election campaign
  • posts recommendation
  • dynamic content management

Sample Data
DimensionsSample 1Sample 2
IDLink8851887218
TitleIsrael denies permits to Gazans for Palestine MarathonLocal organizations join USDA for initiative to bolster region's ...
HeadlineIsrael denies permits to Gazans for Palestine Marathon. A Palestinian youth stands next to a national flag at the Palestinian side of Beit HanounLocal organizations join USDA for initiative to bolster region's economy. Story · Comments. Print: Create a hardcopy of this page; Font Size:
SourceThe Daily StarBristol Herald Courier (press release) (blog)
Topicpalestineeconomy
PublishDate3/31/2016 3:45:16 PM3/31/16 15:54
SentimentTitle0.100503782-0.180421959
SentimentHeadline0.1503516330.0875
Facebook2412
GooglePlus58
LinkedIn20
Data Modeling

Based on this data, we can see it is both a text mining task and also a classification task. We need to understand text and also use other features to label each data entry. The data model is built in the following steps:

  • Build semantic space
  • Reduce dimensions with Latent Semantic Analysis
  • Generate sample set for given news
  • Remove Outliers
  • KNN Classification
  • Adjust parameters for better results
  • Interpret results

For text mining, we can have two solutions

  • Bag of words
  • Latent Semantic Analysis (LSA)
For popularity prediction, we choose Latent semantic analysis.

For other features, such as Source, SentimentTitle, SentimentHeadline, we can directly use them to make prediction. Before we use choose classification algoritm, we evaluate the significance of the features with information gain, as listed below:

WebsiteSourceSentimentTitleSentimentHeadline
Facebook0.6427983828436333>0.6821375279330720.6065400706568865
GooglePlus0.4545358118900283>0.485961474091623660.4117329019171993
LinkedIn0.51773612501037150.55758690485498020.4672467847269911
The information gain shows use that these features are not very strongly related to the prediction. We don't have many features to choose. We don't have domain knowledge to add extra information into the feature set, either. We need to improve the quality of the raw data as much as possible.

Our solution is that we not only use latent semantic analysis to make the text feature more meaningful, also, based on the new data, we only use similar news to predict the label of the new data. In other words, we create a specific data set for each news we are going to classify. The data modeling process can be generalized but each model we generated is specified for the a particular news. We don't create a general model for all of the news, but, each model can classify the particular news better.

Latent Semantic Analysis

Latent semantic analysis (LSA) is a technique for text processing. The idea behind latent semantic analysis is that, in psychology, we consider people learn language through words instead of grammar. Think about how babies learn languages. They just put words together and it is going to make sense. However, the context of the words can not be ignored, although we can ignore the grammar. Each word can have multiple meanings in different context and different words can have the same meanings in the same context. This is the most confusing part in text mining, which almost makes it impossible to process the text. However, when LSA comes into play, the myth is deciphered. The way it works is straightforward that it simplifies the content. The formula of LSA is listed below.

Figure 1. Latent Semantic Analysis

The matrix in the middle on the right side of equal symbol is for topics. We start with all the text and it ends up with we project text into several topics and thos topics are related to the text content.

Performance

We test the data modeling strategy for three times with three different news. The performance is listed below.

Experiment 1

NewsGet a $50 Microsoft Store gift card with Xbox
Dataset size7942
After Sample Selection212
FeaturesSource, SentimentTitle, SentimentHeadline
PredictPopularity
SourceActual ScorePredicted ScoreAverageMaxMin
Facebook91064.538320
GooglePlus237.581860
LinkedIn1122.74380

Experiment 2

NewsWindows 10 Mobile Build 10586.306 now being ...
Dataset size7942
After Sample Selection82
FeaturesSource, SentimentTitle, SentimentHeadline
PredictPopularity
SourceActual ScorePredicted ScoreAverageMaxMin
Facebook23360.510550
GooglePlus016.82340
LinkedIn0022.84330

Experiment 3

NewsXbox One Backward Compatibility: Two new games ...
Dataset size7942
After Sample Selection87
FeaturesSource, SentimentTitle, SentimentHeadline
PredictPopularity
SourceActual ScorePredicted ScoreAverageMaxMin
Facebook6531.27750
GooglePlus005.41810
LinkedIn0524.76220

Top

Featured Transformer Model (FTM) for Text Classification

Natural language is the most common ways to present information in different fields, in different perspectives and in different circumstances so that text becomes a rich source of information. However, with the development of information technology, the amount of documentation becomes impossible for human to handle. We need to take advantage of computers to help automatically understand documentation, group documentation into different categories, topics and subjects until the total amount of information is capable for human to handle.

Attention mechanism is a big breakthrough in deep learning. It becomes an increasingly popular concept and a useful tool in developing deep learning models for Natural Language Processing (NLP). Because it builds a shortcut of the context for input texts and promotes correlated words in one sentence. The importance vector in the attention layer can be updated through Markov-like updates and can be customized easily in classification. The prediction can be made by estimating how strongly the word is correlated with or attends to other words.

We propose a Featured Transformer Methodology (FTM) based on attention mechanism. It can efficiently add domain knowledge to deep learning architecture. The attention mechanism performs Markov-like updates. When the model converts, the associations between domain features and word embeddings can be extracted.

Sample Data

SamplesClass
WBC 8.5 x/uL (4.5-9.8)LAB
RBC 4.5 x/uL (0.5-5.0)LAB
MCV 0.5 x/uL (0.0-5.0)LAB
tramadol Allergy Intermediate hivesDOC
codeine Allergy Mild NAUSEADOC

Methodology

Bidirectional Encoder Representations from Transformers (BERT) has transformer network architecture. Attention layer plays the role of aligning encoder and decoder layers. Encoder and decoder layers are built upon convolution networks instead of recurrent networks to reduce computation costs. Attention layer can be updated through Markov-like process as shown in Figure 1.

Figure 1. Markov Chain Updates

Connections between two sequences are presented in two-dimensional matrix in which each dimension has a sequence of tokens. The more often the connections appear, the larger the values are. Attention mechanism is a shortcut of Markov Chain in which attention layer has less heads than the previous layer. As shown in figure 2, in BERT network architecture, the importance vector of attention layer can align encoder to decoder and highly affect the final prediction. If we can properly adjust importance vector, for example, to promote positive patterns and suppress negative/random patterns, we can efficiently improve the model performance.

We propose Featured Transformer Methodology (FTM) to improve the performance of Transformer networks, especially on BERT network architecture. The idea is that we define patterns according to the domain knowledge, add patterns as arbitrary features to the input layer and use arbitrary features to promote the importance of corresponding terms on attention layer.

Performance

The model was tested with 2871 samples in which 788 samples were labeled as 'lab' and 2083 samples were labeled as 'doc'. As shown in the following table, the experimental results showed that the performance measurements, such as precision, recall, and F1-score, indicated that the model can be used to identify lab tests with 100% in all measurements. In comparison with the Conditional Random Fields (CRF) model, Bi-directional Encoder Representation Transformer (BERT), Naïve Bayes Support Vector Machine (NBSVM), Logistic Regression (LOGREG), FASTTEXT, Standard Gated Recurrent Units (STANDARD GRU), Bi-directional Gated Recurrent Units (BiGRU), Featured Transformer Methodology (FTM) can perform much better than CRF, BERT, NBSVM, LOGREG, FASTTEXT, STANDARD GRU, and BiGRU in all measurements.

Table 1. Performance of FTM on 2871 Samples

PrecisionRecallF1-score
DOC1.001.001.00
LAB1.001.001.00
Macro Avg1.001.001.00
Weighted Avg1.001.001.00

Table 2. Performance of CRF on 2871 Samples

PrecisionRecallF1-score
DOC0.9270.9890.957
LAB0.9590.7640.850
Macro Avg0.9430.8760.904
Weighted Avg0.9350.9330.931

Table 3. Performance of BERT on 2871 Samples

PrecisionRecallF1-score
DOC0.990.990.99
LAB0.990.990.99
Macro Avg0.990.990.99
Weighted Avg0.990.990.99

Table 4. Performance of NBSVM on 2871 Samples

PrecisionRecallF1-score
DOC0.920.980.95
LAB0.990.970.98
Macro Avg0.960.970.96
Weighted Avg0.970.970.97

Table 5. Performance of LOGREG on 2871 Samples

PrecisionRecallF1-score
DOC0.920.960.94
LAB0.980.970.97
Macro Avg0.950.960.96
Weighted Avg0.960.960.96

Table 6 Performance of FASTTEXT on 2871 Samples

PrecisionRecallF1-score
DOC0.970.970.97
LAB0.990.990.99
Macro Avg0.980.980.98
Weighted Avg0.980.980.98

Table 7. Performance of STANDARD GRU on 2871 Samples

PrecisionRecallF1-score
DOC0.970.940.95
LAB0.980.990.98
Macro Avg0.970.960.97
Weighted Avg0.970.970.97

Table 8. Performance of BiGRU on 2871 Samples

PrecisionRecallF1-score
DOC0.980.980.98
LAB0.990.990.99
Macro Avg0.990.990.99
Weighted Avg0.990.990.99

Top

Entity Extraction through Conditional Random Fields model (CRF)

In health care industry, lab test sections are normally written in free text format. With the rapid expansion of health care claims, the demands for automatically extracting lab test information from raw messages give rise to the opportunities to train computers to automatically review, extract and aggregate information to make data products. To make terms meaningful and to extract them from the text, Named Entity Recognition (NER) is one of the most used techniques.

There are several challenges in lab entity extraction. Each test starts with test name followed by quantities, units, references, and dates. Test names are pre-defined medical terms that can be stored in a domain dictionary. Units can also be considered as a set of pre-defined terms. The test results and references have pre-defined formats that have the minimum values, the maximum values, separators, and parenthesis, for example “(1.2-2.5)”. The formats of the test dates are also pre-defined. Other than the characteristics of each individual token in test results, the format of the test section is also different from other sections. In the test section, each row is a lab test, and each column is a measurement that can be results, units, references, etc. Test results and references fields can have words in common. Also, test units and abnormal flags fields can have words in common. However, each field needs to be labeled separately. All the tests are listed together in separate lines. We want to model these characteristics of the lab sections and to extract lab entities from the doctor’s notes.

Sample Data
RawLab Test NameValueUnitReference RangeAbnormal Flag
RBC, 4.02, 10E6/UL, (4.50-5.90), L RBC['4.02']['10E6/UL']['(4.50-5.90)']['L']
GRAN, 0.1, 10E3/UL, (0.0-0.0), HGRAN['0.1']['10E3/UL']['(0.0-0.0)']['H']
BUN, 22, MG/DL, (6-20), HBUN['22']['MG/DL']['(6-20)']['H']
GLUCOSE, 135, MG/DL, (70-110), HGLUCOSE['135']['MG/DL']['(70-110)']['H']
C-REACTIVE, PROTEIN, 4.7, MG/DL, (0.5-1.0), HC-REACTIVE PROTEIN['4.7']['MG/DL']['(0.5-1.0)']['H']
Performance
EvaluationPrecisionRecallF1-scoreSupport Samples
LAB Test Name11127700
Reference Range1111874
Unit111884
Abnormal Flag1111915
Value1119510
Macro Avg11141883
Weighted Avg11141883

Top

Credit Card Fraud Detection

In the banking industry, several applications are performed on a daily basis to ensure the smooth functioning of the industry. Customer profiling, risk detection, and fraud detection are some of the critical applications that are performed by banks to manage their operations efficiently and provide a secure and reliable service to their customers.

Customer profiling involves collecting and analyzing data about customers to understand their needs and preferences. This information helps banks to develop personalized products and services that meet the specific needs of their customers.

Risk detection involves identifying potential risks that can affect the financial health of the bank. Banks use advanced analytical techniques to identify and monitor risks related to credit, market, liquidity, operational, and reputational risks. This helps them to take proactive measures to mitigate these risks and ensure the stability of their operations. Fraud detection involves identifying and preventing fraudulent activities that can harm the bank and its customers. Banks use sophisticated fraud detection techniques to identify patterns and anomalies in customer transactions that may indicate fraudulent activities. This helps them to prevent financial losses and protect the interests of their customers.

Credit card fraud detection (CCFD) is like looking for needles in a haystack. It requires finding, out of millions of daily transactions, which ones are fraudulent. Due to the ever-increasing amount of data, it is now almost impossible for a human specialist to detect meaningful patterns from transaction data. For this reason, the use of machine learning techniques is now widespread in the field of fraud detection, where information extraction from large datasets is required.

In the simplest form, a payment card transaction consists of any amount paid to a merchant by a customer at a certain time. For fraud detection, it is also generally assumed that the legitimacy of all transactions is known as either genuine or fraudulent. This is usually represented by a binary label, with a value of 0 for a genuine transaction, and a value of 1 for fraudulent transactions.

In credit card fraud detection, data typically consists of transaction data, collected, for example, by a payment processor or a bank. Transaction data can be divided into three groups:

  • Account-related features: They include for example the account number, the date of the account opening, the card limit, the card expiry date, etc.
  • Transaction-related features: They include for example the transaction reference number, the account number, the transaction amount, the terminal (i.e., POS) number, the transaction time, etc. From the terminal, one can also obtain an additional category of information: merchant-related features such as its category code (restaurant, supermarket, …) or its location.
  • Customer-related features: They include for example the customer number, the type of customer (low profile, high profile, …), etc.
The challenges in credit card fraud detection include class imbalance, concept drift, near real-time requirements, categorical features, sequential modeling, class overlap, performance measurements, and lack of public datasets. We provide solution to balance normal/fraud samples and evaluate the solution with Logistic Regression, Decision Tree with depth of two, Decision Tree with unlimited depth, Random Forest, and XGBoost.

Sample Data
FeaturesValues
TX_AMOUNT57.12
TX_DURING_WEEKEND1.0
TX_DURING_NIGHT1.0
CUSTOMER_ID_NB_TX_1DAY_WINDOW1.0
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW57.12
CUSTOMER_ID_NB_TX_7DAY_WINDOW1.0
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW57.12
CUSTOMER_ID_NB_TX_30DAY_WINDOW1.0
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW57.12
TERMINAL_ID_NB_TX_1DAY_WINDOW0.0
TERMINAL_ID_RISK_1DAY_WINDOW0.0
TERMINAL_ID_NB_TX_7DAY_WINDOW0
TERMINAL_ID_RISK_7DAY_WINDOW0.0
TERMINAL_ID_NB_TX_30DAY_WINDOW0.0
TERMINAL_ID_RISK_30DAY_WINDOW0.0
TX_FRAUD1
Performance
AUC ROCAverage PrecisionCard Precision@100
Logistic Regression0.6440.1260.181
Decision Tree with Depth=20.8290.2460.215
Decision Tree with Unlimited Depth0.9150.5760.222
Random Forest0.9910.9440.223
XGBoost0.9490.8100.134

Top

About

Dr. Lucy Lu is a data scientist. Her research interests include Insights Analysis, Natural Language Processing, Entity Resolution, Deep Learning, Signal Processing and Image Processing. She has been providing machine learning solutions in different industries. Her solutions have won best paper awards for three times. With her deep-understandings in theories and her experiences in technical details, her solutions can achieve almost perfect results.

Letter of Recommendation


To whom it may concern,

I am pleased to write this enthusiastic letter of support for the application of Ms. Lucy Lu for a data scientist position in your organization. Lucy Lu is a PhD. in computer Science from University of South Florida (USF) where from 2020 to 2023, I served as Co-Major Professor on her dissertation committee for thesis titled “Graph Analysis on Social Networks”.

I have known Lucy Lu for 12 years and have collaborated with her on numerous research projects. Our collaborative work has resulted in 22 publications with Lucy Lu as the lead author of 15 of these, and includes six journal articles, two book chapters and eight conference proceedings, as well as several poster presentations at professional meetings. The areas of our research include Data Modeling applications of data mining, text mining, deep learning to problems that include medical record linkage, image processing, social networks analysis, bioinformatics, and others.

Lucy Lu has demonstrated very strong ability to work closely with high-skilled interdisciplinary team of modelers and empiricists at multi-institutions.

We have co-authored two book chapters in Encyclopedias that pertain to “Linkage in Medical Records and Bioinformatics Data” and “Linkage Discovery with Glossaries”. I am providing below this letter a List of Publications co-authored with Lucy Lu.

Lucy Lu and I also worked collaboratively with Professor Les Piegl in Department of Engineering and Computer Science at University of South Florida (USF) on graph sampling based on both primary graphs and graph evolution.

Three of the conference proceedings of which Lucy Lu was lead author were selected as “Session Best Paper Awards” at the 2016, 2013, and 2011 World Multi-conference on Systemics, Cybernetics and Informatics (WMSCI), that was also published in the Journal of Systemics, Cybernetics and Informatics (JSCI) that according to the Cabell’s Directory of Publishing Opportunities has an acceptance rate of 20% and is titled “Statistical Quality Control of Microarray Gene Expression Data”.

Lucy Lu and I worked for about a year on a manuscript that was published in 2013 in the International Journal of Information and Decision Sciences (IJIDS) that according to the Cabell’s Directory of Publishing Opportunities has an acceptance rate of 20% to 30% and is titled “Linkage in Medical Records and Bioinformatics Data”.

Lucy Lu and I have also presented posters in 2011 and 2012 at P3 Annual Meetings that were titled “Data Quality Control and Segmentation for Bioinformatics” in 2011 and “E-Book Reader: a Software Took for Knowledge Discovery from Bioinformatics Proceedings” in 2012.

I highly recommend Lucy Lu for a data scientist position in your organization. If you have additional questions, please feel free to contact me at phones of 870-972-3989 (Office) and 870-761-8923 (Cell) or by e-mail at rsegall@astate.edu.

Sincerely,
Dr. Richard Segall, Professor