Apply machine learning and deep learning in real-world problems.
Extract domain knowledge from textual content in scholarly papers.
Build focused search engine systems to index and discover semantic level information.
Develop tools to process and present multi-modality scholarly big data at large scale.
Utilizing the established infrastructure, this project aims to create a sustainable CiteSeerX system with new data resources and a much larger data collection. We will develop a new system that runs with low operation overhead, and that provides quality and enriched data and metadata in portable formats that will be available through accessible user interfaces. We will ingest all freely accessible scientific documents on the Web, currently estimated to be 30 million. CiteSeerX will make available high-quality metadata through an accessible Web User Interface, Application Programming Interface, and data dumps. This project is supported by the National Science Foundation. Key people include PI Dr. C. Lee Giles (PSU) and Co-PI Dr. Jian Wu.
The ODU's role co-direct graduate students and postdoctoral scholars on designing essential components of infrastructure, architecture, data acquisition, extraction, ingestion, cleansing, and indexing.
We will bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). The library and archives fields lack research on extracting and analyzing segments of long documents (chapters, reference lists, tables, figures), as well as methods for summarizing individual chapters of longer texts to enable findability. The project brings cutting-edge CS and machine learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. By focusing on libraries' ETD collections, the research will enhance ETD programs, devising effective and efficient methods for opening the knowledge currently hidden in the rich body of graduate research and scholarship. This project is supported by the Institute of Museum and Library Services (IMLS). This project is a joint effort between Virginia Tech and ODU, directed by PI Bill Ingram, Co-PI Dr. Edward A. Fox, and Co-PI Dr. Jian Wu.
Currently, the ODU team will be responsible for extracting metadata and full text out of scanned ETDs using OCR techniques and then segmenting full text into chapters and sections. The ODU team will extract semantic information such as concepts and their definition.
This project studies R&R (repeatability and reproducibility) of experiments in academic papers published in social science by researching and developing systems and methods for assigning confidence scores to specific findings published in the social science literature. The final products include a prototypical instantiation of the proposed system that functions within the CiteSeerX framework that also maintains explainability of its assertations. This project is supported by Defense Advanced Research Projects Agency (DARPA). It is a collaborative effort of Pennsylvania State University, Texas A&M University, Microsoft Research, and ODU. Key people include Dr. C. Lee Giles (PSU), Dr. Sarah Rajtmajer (PSU), Dr. Chris Griffin (PSU), Dr. Anna Squicciarini (PSU), James Caverlee (PSU), Xia (Ben) Hu (TAMU), Dr. Frank Shipman (TAMU), and David Pennock (Microsoft).
The ODU team works with the PSU team to perform information extraction from scholarly papers, including but not limited to header, citations, acknowledgement, domain knowledge entities, math expressions, and integrate them into the PDFMEF framework, which will be part of the final system.
The goal of this project is to automate the understanding of technical content contained in scientific images. In particular, the goal is to track the spread of technical information by finding copies and modified copies of technical diagrams in patent databases; as well as to label electronic components within tomography images. These two applications share in common the property that shape and topology within the image are the most important features. Computer vision, especially through the use of machine learning methods, has dramatically improved the ability to detect objects in images and semantically segment images to automate labelling of pixel within an image. However, these advances have not yet automated the understanding of information contained in hand-drawn figures, technical diagrams, and imagery produced for scientific inquiry. The key innovation is the insight that these technical images carry little per-pixel information compared with the natural images (photographs and video), and that context, topology and shape provide information. By representing images as hierarchical graphs, with annotations on topological relationships, the project will model the context and knowledge necessary to perform semantic-level analysis of images. Key people include Dr. Diane Oyen (LANL), Dr. C. Lee Giles (PSU), and Dr. Jian Wu (ODU).
Currently, the ODU team is investigating the state-of-the-art techniques on image retrieval, focusing on image meta search (text-based search), and probing the feasibility to apply them to technical images and diagrams (as opposed to natural images). This project is supported by the Department of Energy (DoE) through the Los Alamos National Laboratory (LANL).
This class will introduce the process of writing interactive web applications accessible through the WWW. Students will develop in the LAMP environment with ElasticSearch as the search platform. Emphasis will be on the integration of these components for a useful application, a search engine either based on semistructured or unstructured documents. Lectures will provide the overview of various concepts and the class will be centered around development of a semester-long project. Prerequisites include Web familiarity, programming knowledge, database, and search experience. The course will give best practice instruction and guidance in developing a website with searching functions using a LAMP stack, HTML, Javascript, PHP, and MySQL, along with other more modern technologies, languages, and systems. The course will require students to use Git for version control via GitHub and project submission.
This course was offered by me at ODU CS in Fall 2021 (syllabus), Spring 2021, Fall 2020, and Fall 2019.
This course aims to prepare Computer Science and Cybersecurity students for obtaining a fundamental understanding of the relational database concepts and practical skills to analyze and implement a well-defined database design. In particular, CS450/550 provides an introduction to physical database design, data modeling, relational model, logical database design, SQL query language, and instructors’ choices on database applications and advanced concepts. Students will learn to use a real-world open-source database management system. Upon taking CS450/550, students should be able to understand the implications and future directions of databases and database technologies.
This course is offered by me at ODU CS in Spring 2022 (syllabus) and Spring 2021.
This class will explore the theory and engineering of information retrieval in the context of developing web-based search engines. The course will explore topics related to crawling, ranking, query processing, retrieval models, evaluation, clustering, and other aspects related to building search engines. The course will also cover recently established ranking algorithms that incorporate semantic similarities, machine learning, and neural network methods, such as learning to rank and neural information retrieval. The class will feature several hands-on development and coding using tools such as Google Custom Search, ElasticSearch, as well as a theoretical exploration of the existing literature on these topics. An external speaker will also be invited to give a talk on contemporary search engine and related topics (depending on availability). Students must be comfortable with self-directed learning appropriate for an advanced graduate class.
This course was offered by me at ODU CS in Fall 2021 (syllabus) and Spring 2019.
One of the computer science subject areas that are the most impacted by artificial intelligence in the last decade is natural language processing (NLP). This technology further leads to advancements for machines to read, understand, and write textual content.
This seminar is designed to use textual content in scientific documents as an example to train graduate students effective and efficient ways to process text and extract statistical, syntactical, and semantic features from free text. The other half of the seminar will cover contemporary research topics in scholarly big data, an instance of big data, and more broadly text mining. The course will introduce commonly used machine learning (ML), NLP, and information retrieval (IR) tools as a preparation for a course project.
This course was offered by me at ODU CS in Fall 2020 (syllabus) and Fall 2018.
Over the past two decades, with the advent and prevalence of GPUs and recently adopted TPUs, deep learning has made significant revolutionary advances, making remarkable progress on state-of-the-art tasks in traditional natural language processing (NLP) and computer vision (CV). In this background, a new subject field called natural language understanding (NLU) emerged out of and has received much attention by both academia and industrial researchers. The core task of NLU is to tackle fundamental challenges to train and test computer algorithms that effectively and efficiently represent human language by data structures that are processable by computers and to build artificial intelligent (AI) systems to mimic human’s ability to interpret and generate human languages.
The subject covers many emerging research topics. Some have made substantial progress over the past decade (such as building pre-trained language models) and some are still challenging (such as automatically generating coherent abstract summaries). This topical course is designed for graduate students to learn fundamental concepts and algorithms of deep learning and to explore important research topics in NLU including contextual representation models, grounded language understanding, natural language reference, supervised sentiment analysis, neural information retrieval, relation extraction with distant supervision, semantic parsing. The course will also introduce representative benchmark datasets and evaluation metrics.
This course is offered by me at ODU CS in Spring 2022 (syllabus).
As the database management software becomes one of the critical components in modern IT applications and systems, a solid understanding of the fundamental knowledge on the design and management of data is required for virtually any IT professionals. In a business setting, such IT professionals should be able to talk to the clients to derive right requirements for database applications, ask the right questions about the nature of their entities and in-between relationships in their business scenarios, analyze and develop an effective and robust design to address business constraints, and react to the existing database designs as new needs arise. Solid understanding of the underlying data models and design issues in data applications are also critical for SRA (Security and Risk Analysis), Cyber-security students to ensure secure access to an intelligent analysis of data in complex business settings. Modern IT professionals should be able to guide a company in the best use of the diverse database-related technologies and applications for the “Big Data” era.
As such, IST 210 aims to prepare students for obtaining a fundamental understanding of the concepts and practical skills to analyze and implement a well-defined relational database design. In particular, IST 210 provides an introduction to physical database design, data modeling, relational model, logical database design, SQL query language, and instructors’ choices on database applications and advanced concepts. Students will learn to use a real-world commercial or open-source database management system, too. Upon taking IST 210, students should be able to understand the implications and future directions of databases and database technologies.
This course was offered by me at Penn State IST in Spring 2017, Fall 2017, Sprng 2018.
This is a first course in programming principles for application development. The course will focus on application development foundations including: fundamental programming concepts; basic data types and data structures; problem solving using programming; basic testing and debugging; basic computer organization and architecture; and fundamentals of operating systems. This is a hands-on course designed to help students learn to program a practical application using modern, high-level languages.
This course was offered by me at Penn State IST in Fall 2017, Sprng 2018.
This course is intended to prepare students to understand, design, develop and use information retrieval and search systems. The course will cover: organization, representation, and access to information; categorization, indexing, and content analysis; data structures for unstructured data; design and maintenance of such data structures, indexing and indexes, retrieval and classification schemes; use of codes, formats, and standards; analysis, construction and evaluation of search and navigation techniques; and search engines and how they relate to the above. Students will build a specialty web search engine using open source web tools and focused web crawling.
I co-taught this course with Dr. C. Lee Giles at Penn State IST in Sprng 2015, Spring 2016, Spring 2017. The official course page of IST441 is here.
Since the beginning of the 21st century, the computer and information science has witnessed rapid and unprecedented advances in artificial intelligence (AI), represented by the prosperity of machine learning and deep learning algorithms. However, many of these algorithms and models are limited to lab experiments. In real-world problems, data are often noisy, contaminated, and deficient. Usually, a single model is not sufficient to meet specific requirements, which calls for systems consisting of multiple components.
The mission of the lab is to apply machine learning and deep learning techniques in real-world problems, focusing on building systems to solve multidisciplinary problems using building blocks in natural language processing, scholarly big data, digital libraries, and information retrieval.
The lab logo is a lighthouse drawn by a child. We are viewing the world with our naked eyes like little children, attempting to represent it using sketchy strokes and simple colors.
Muntabir started working with Dr. Wu in the fall of 2019. He obtained his Bachelor's degree in Elizabethtown College in Pennsylvania and then worked as an engineer at Resource9 Inc. at New York City. Muntabir is pursuing a PhD degree and a graduate research assistant. He works on a project collaborated with Virginia Tech to mine information from scanned electronic theses and disserations (ETDs).
Xin is a PhD student of computer science. She started working with Dr. Wu since summer 2020. Previously she was working with Dr. Cong Wang on cybersecurity. Xin's research focuses on extracting semantic information from scientific papers. She has participated in and then led the information extraction effort for the SCORE project and the semantic information extraction from US design patent.
Kehinde (Kenny) started his PhD with Dr. Wu in Spring 2021. His research focused on accurately extracting data from scientific tables. Kenny participated in the project to build a large-scale patent image dataset. He interned at Microsoft as a Data Scientist in summer 2022.
Lamia started her PhD from Spring 2021. Her research is focused on investigating computational reproducibility of research papers. Lamia also participated in a project to improve the metadata quality of electronic theses and dissertations.
Pei started working with Dr. Wu in the spring of 2020. He obtained his Bachelor's degree in Beihang Uniersity in Beijing, China and then worked as a commentator for VSPN. He worked with Dr.Wu as a graduate research assistant for one year on acknowledgement extraction and then transferred to Virginia Tech. After obtaining a master's degree, he was hired by Microsoft as a Data Scientist.
I am actively recuiting undergraduate and graduate students to join my lab. Below are opening positions.
Task Description:
The proliferation of disinformation in scientific domains has become a growing concern in recent years, with increasing attention given to the development of automated approaches for scientific claim verification, such as verifying the truthfulness of whether mosquitos transmit coronavirus. Various efforts have been made to fact-checking scientific claims, with a recent focus on automating the process through the use of machine learning and natural language processing techniques. Existing research methods are common in that they leverage pretrained language models to embed claims and evidence documents. Recently, large language models such as GPT families have exhibited superior performance in reading and comprehension and question-answering tasks. But whether GPT has reached or exceeded the human’s capability to discern scientific disinformation is an open question. Our research will answer this question by collecting data from college students in various majors. Dr. Jian Wu, assistant professor of Computer Science, in collaboration with his student Stefania Dzhaman at Lehigh University, is recruiting undergraduate students for a paid task to label the truthfulness and rationale of scientific claims. The annotations will be done online using a web portal that has been developed for this task. All are multiple choice questions. Participants will watch a short training video (less than 3 minutes) before they could start working.
Requirements:
Compensation:
Participants will be compensated at an Amazon Gift Card of $80, sent by email.
Tentative Schedule:
Participants will receive an online training by watching a short video after receiving the recruitment confirmation email. After that, the annotation will start annotation. The total amount of time required to finish the task is about 5 -- 7.5 hrs.
Application:
To apply for this task, please fill out the Application form. The application will close when the desired number is reached. For any questions, please contact Dr. Jian Wu (j1wu@odu.edu).
The LAMP-SYS lab at ODU is recruiting motivated undergraduate students enrolled in the Computer Science program to participate in research projects about CiteSeerX under the mentoring of Dr. Jian Wu. CiteSeerX is a digital library search engine providing over 10 million academic documents online. One project (PDFMEF) will develop scalable and customizable information extraction software that can process millions of PDF documents in a timely manner. The other project (Online Voting) involves frontend design to facilitate evaluation of multiple keyphrase extraction models with crowdsourcing. Either project will last for for the summer and the fall semesters. The basic requirements of qualified candidates include:
Basic requirements:
The LAMP-SYS Lab at the Computer Science Department at the Old Dominion University at Norfolk, VA, USA is recruiting a fully supported PhD student to conduct research on Applied Machine Learning and Natural Language Processing Systems. The student will work with Dr. Jian Wu, assistant professor of Computer Science, on mining scholarly big data and digital libraries. The project will leverage cutting-edge technologies in machine learning, deep learning, natural language processing, and big data on information extraction, classification, and retrieval from scholarly big data corpora, including but not limited to electronic theses and dissertations (ETDs), research papers, news articles, and Wikipedia articles. Specific tasks include but not limited to typed-entity and relation extraction, citation graph generation and analysis, building search engine systems, developing multiclass and multilabel classification models, and applying word-embedding on text retrieval and summarization tasks. The lab directed by Dr. Jian Wu will closely collaborate with the DLRL group at Virginia Tech and the CiteSeerX group at the Pennsylvania State University in form of data and software sharing and online meetings.
The requirements include the following:
To be considered for this position, please email Dr. Jian Wu (jwu@cs.odu.edu) the following materials:
Please note that submitting the above documents does not constitute a full application for admission. The applicants may be asked to provide additional documents and materials required by the ODU graduate school (see below) if they are encouraged to apply. Recruitment will close when the position is filled.
Dr. Jian Wu is an assistant professor in the Computer Science Department at the Old Dominion University (ODU), Norfolk, Virginia, United States. He is the tech leader of the CiteSeerX project, directed by Dr. C. Lee Giles. He is a member of the Web Science and Digital Libraries Research Group (WS-DL). He directs the Lab for Applied Machine Learning and Natural Language Processing Systems (LAMP-SYS) at ODU. Before joining ODU, Dr. Jian Wu was an assistant teaching professor in the College of Information Sciences and Technology (IST) at the Pennsylvania State University.
Dr. Jian Wu received his bachelor's degree in 2004 from the University of Science and Technology of China (USTC) in Physics and Astronomy. He obtained his Ph.D. degree from the Department of Astronomy and Astrophysics at the Pennsylvania State University in August 2011. After that, he joined the CiteSeerX team led by Dr. C. Lee Giles. Jian Wu is the tech leader of the CiteSeerX project. He led a small team to scale the CiteSeerX collection from 3 million to 10 million academic documents from 2015 to 2018. Dr. Jian Wu is the Co-PI of an NSF supported project to build a scalable and sustainable CiteSeerX to support the scholarly big data in the long term. Dr. Jian Wu has published 48 peer-reviewed papers in ACM, IEEE, AAAI conferences, journals, and magazines, as of October 2020, including best paper award and nominations. Dr. Jian Wu also published 7 journal articles in astronomical journals in his early career.
Dr. Jian Wu's collaborators include but not limited to Dr. Michael Nelson, Dr. Michele Weigle, Dr. Sampath Jayarathna at ODU CS, the CiteSeerX team directed by Dr. C. Lee Giles at the Pennsylvania State University, the Digital Library Research Laboratory directed by Dr. Ed Fox at Virginia Tech, Dr. Diane Oyen's group at the Los Alamos National Laboratory (LANL), the Document and Pattern Recognition Lab directed by Dr. Richard Zanibbi at Rochester Institute of Technology (RIT), University of Chicago at Illinois (UIC), and National Singapore University (NUS). Dr. Wu's research is supported by the NSF, IMLS, DARPA, and DoE.
Dr. Wu's curriculum vitae can be downloaded here.
ORCID: 0000-0003-0173-4463.
Office: 3202 ECSB, Old Dominion University, Norfolk, VA, 23529.
Office phone: +1(757)683-7753.
Email: jwu at cs dot odu dot edu.